iTransformer: Inverted Transformers Are Effective for Time Series Forecasting
TL;DR Summary
iTransformer inverts Transformer dimensions to embed time points into variate tokens, enhancing multivariate correlation modeling and nonlinear representation, addressing large lookback window challenges and achieving state-of-the-art time series forecasting performance and gener
Abstract
The recent boom of linear forecasting models questions the ongoing passion for architectural modifications of Transformer-based forecasters. These forecasters leverage Transformers to model the global dependencies over temporal tokens of time series, with each token formed by multiple variates of the same timestamp. However, Transformers are challenged in forecasting series with larger lookback windows due to performance degradation and computation explosion. Besides, the embedding for each temporal token fuses multiple variates that represent potential delayed events and distinct physical measurements, which may fail in learning variate-centric representations and result in meaningless attention maps. In this work, we reflect on the competent duties of Transformer components and repurpose the Transformer architecture without any modification to the basic components. We propose iTransformer that simply applies the attention and feed-forward network on the inverted dimensions. Specifically, the time points of individual series are embedded into variate tokens which are utilized by the attention mechanism to capture multivariate correlations; meanwhile, the feed-forward network is applied for each variate token to learn nonlinear representations. The iTransformer model achieves state-of-the-art on challenging real-world datasets, which further empowers the Transformer family with promoted performance, generalization ability across different variates, and better utilization of arbitrary lookback windows, making it a nice alternative as the fundamental backbone of time series forecasting. Code is available at this repository: https://github.com/thuml/iTransformer.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
iTransformer: Inverted Transformers Are Effective for Time Series Forecasting
1.2. Authors
Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, Mingsheng Long. Affiliations include the School of Software, BNRist, Tsinghua University, Beijing, China, and Ant Group, Hangzhou, China. The authors are primarily affiliated with Tsinghua University, a prominent research institution, with some also associated with Ant Group, a major technology company, suggesting a blend of academic rigor and industrial relevance in their research.
1.3. Journal/Conference
The paper was published on arXiv, a preprint server, on 2023-10-10T13:44:09.000Z. arXiv is widely used in the academic community for disseminating research rapidly before peer review or formal publication. While not a peer-reviewed journal or conference in itself, papers on arXiv often precede submissions to prestigious venues.
1.4. Publication Year
2023
1.5. Abstract
The paper addresses the challenges faced by traditional Transformer-based time series forecasters, which often model global dependencies over temporal tokens (multiple variates at the same timestamp). These challenges include performance degradation and computational explosion with larger lookback windows, and difficulties in learning variate-centric representations due to the fusion of distinct measurements within temporal tokens. The authors propose iTransformer, which repurposes the Transformer architecture by inverting the dimensions: attention and the feed-forward network (FFN) are applied on the inverted dimensions. Specifically, the time points of individual series are embedded into variate tokens, allowing the attention mechanism to capture multivariate correlations, while the FFN processes each variate token to learn nonlinear representations. iTransformer achieves state-of-the-art performance on real-world datasets, demonstrating promoted performance, generalization ability across different variates, and better utilization of arbitrary lookback windows, positioning it as a fundamental backbone for time series forecasting.
1.6. Original Source Link
https://arxiv.org/abs/2310.06625 The publication status is a preprint on arXiv. PDF Link: https://arxiv.org/pdf/2310.06625v4.pdf
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the inefficiency and performance limitations of existing Transformer-based models when applied to multivariate time series forecasting. Traditional Transformer models, drawing inspiration from natural language processing, often treat each timestamp as a "token," embedding multiple variates (different measurements or features) from that single time point into one combined representation. This approach has several critical drawbacks in time series forecasting:
-
Performance Degradation with Large Lookback Windows: As the historical data (lookback window) increases, Transformer performance can degrade, and computational costs (especially for self-attention) explode quadratically.
-
Meaningless Attention Maps: Fusing multiple variates (which might represent delayed events or distinct physical measurements) into a single temporal token can lead to
meaningless attention maps, as the model struggles to learnvariate-centricrepresentations. The attention mechanism might not effectively capture true dependencies when tokens are not semantically cohesive. -
Competition from Linear Models: The recent success of simpler linear forecasting models (like
DLinearorTiDE) which often outperform complex Transformer architectures, further questions the efficacy of current Transformer adaptations for time series. This suggests that the issue might not be with the Transformer's core components but with how they are applied to time series data.The paper's entry point or innovative idea is to re-evaluate the fundamental application of Transformer components to time series by
invertingthe dimensions. Instead of processing time-aligned (temporal) tokens,iTransformerprocessesvariate-alignedtokens, essentially treating each individual time series (corresponding to one variate) as a token. Thisinvertedperspective aims to address the aforementioned issues by better aligning the Transformer's strengths (capturing dependencies) with the inherent structure of multivariate time series data.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
-
Repurposed Transformer Architecture (
iTransformer): It proposesiTransformer, a novel framework that applies native Transformer components (self-attention, feed-forward networks, layer normalization) oninverted dimensions. Instead of temporal tokens, it treats independent time series of each variate as tokens. This is a fundamental architectural shift without modifying the core Transformer components themselves. -
Enhanced Multivariate Correlation Capture: By embedding independent time series as
variate tokens, the self-attention mechanism is naturally repurposed to capturemultivariate correlations(dependencies between different variates) across their entire historical sequences, leading to more interpretable attention maps. -
Improved Series Representation Learning: The feed-forward network (FFN) is applied to each
variate tokenindependently, allowing it to learnseries-global representationsfor individual time series, which are then used for forecasting future steps. This is argued to be more effective than FFNs ontemporal tokenswhich are often too localized. -
State-of-the-Art Performance:
iTransformerachieves comprehensive state-of-the-art performance on challenging real-world datasets, outperforming many advanced Transformer-based and linear forecasting models. -
Promoted Performance and Generalization: The
invertedframework consistently improves the performance of various Transformer variants (e.g.,Transformer,Reformer,Informer,Flowformer,Flashformer). It also demonstrates strong generalization ability onunseen variatesand effectively utilizesarbitrary lookback windowsfor more precise predictions, addressing key pain points of previous Transformer-based forecasters.These findings collectively solve the problems of performance degradation with larger lookback windows, computational explosion for
vanillaTransformers, and the generation ofmeaningless attention mapsby providing a more appropriate and efficient way to apply Transformer components to multivariate time series data.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand iTransformer and its innovations, a foundational grasp of Transformers and Time Series Forecasting is essential.
3.1.1. Time Series Forecasting
Time series forecasting involves predicting future values of a variable based on its past observations.
- Univariate Time Series: A single sequence of observations recorded over time.
- Multivariate Time Series: Multiple related time series observed simultaneously. For example, in a weather dataset, temperature, humidity, and wind speed are distinct
variates(variables) that form a multivariate time series. - Lookback Window (Input Length): The length of historical data used to make a prediction. Often denoted as .
- Prediction Length (Horizon): The number of future time steps to be predicted. Often denoted as .
3.1.2. Transformer Architecture
The Transformer (Vaswani et al., 2017) is a neural network architecture primarily based on the self-attention mechanism. It was originally developed for natural language processing (NLP) tasks but has since been adapted for various domains, including computer vision and time series analysis.
The vanilla Transformer typically consists of an encoder and a decoder. iTransformer uses an encoder-only architecture. Key components of a Transformer encoder block include:
- Tokens and Embeddings: Input sequences (e.g., words in NLP, time points in time series) are first converted into numerical representations called
embeddings. These embeddings are then treated astokensthat the Transformer processes. - Positional Encoding: Since Transformers process input tokens in parallel without inherent sequence order,
positional encodingsare added to the embeddings to inject information about the relative or absolute position of tokens in the sequence. This is crucial for tasks where order matters, like time series. - Multi-Head Self-Attention (MHSA): This is the core mechanism that allows the Transformer to weigh the importance of different tokens in the input sequence when processing each token.
- For each token, three vectors are created:
Query(),Key(), andValue(). These are typically derived by linearly transforming the token's embedding. - The attention score between a
Queryand aKeyrepresents how much theQueryshould "pay attention" to theKey. These scores are calculated for allQuery-Keypairs. - The scores are then scaled by (where is the dimension of
Keyvectors) and passed through asoftmaxfunction to obtain attention weights. - Finally, these weights are used to compute a weighted sum of the
Valuevectors, producing the output of the attention mechanism. Multi-headmeans this process is repeated multiple times in parallel with different linear transformations, and the results are concatenated and linearly transformed again. This allows the model to capture different types of relationships.- The standard formula for
Self-Attentionis: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:- (Query), (Key), (Value) are matrices representing the input transformations.
- is the dimension of the key vectors.
- is the dot product of Query and Key matrices, forming the raw attention scores.
- normalizes the scores to create weights.
- The output is a weighted sum of the Value vectors.
- For each token, three vectors are created:
- Feed-Forward Network (FFN): After the attention sub-layer, each token's output is passed through an identical, position-wise
feed-forward network. This typically consists of two linear transformations with a ReLU activation in between. It processes each position (token) independently. - Layer Normalization (LayerNorm): Applied after each sub-layer (attention and FFN),
LayerNormnormalizes the activations across the feature dimension for each input sample independently. This helps stabilize training and speed up convergence.- The
LayerNormoperation for a vector is: $ \mathrm{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta $ Where:- is the mean of the elements in .
- is the variance of the elements in .
- and are learnable scale and shift parameters.
- is a small constant for numerical stability.
- The
- Residual Connections:
Residual connections(or skip connections) are used around each sub-layer, followed byLayerNorm. This helps with training deeper networks by allowing gradients to flow more easily. The output of each sub-layer is .
3.1.3. Multi-Layer Perceptron (MLP)
An MLP is a class of feedforward artificial neural networks. It consists of at least three layers of nodes: an input layer, a hidden layer, and an output layer. Each node (neuron) uses a nonlinear activation function, except for the input nodes. MLPs are capable of learning complex non-linear relationships. In iTransformer, MLPs are used for embedding and projection layers.
3.2. Previous Works
The paper categorizes previous Transformer-based forecasters into four categories based on modifications to components and architecture (Figure 3).
The following figure (Figure 3 from the original paper) shows Transformer-based forecasters categorized by component and architecture modifications:
该图像是图表,展示了基于Transformer的时间序列预测模型按组件和架构修改的分类。图中以二维坐标区分修改组件与修改架构,突出显示了iTransformer在架构和组件双重修改中的位置。
-
Component Adaptation (e.g., attention module modification): This category, including works like
Autoformer(Wu et al., 2021),Informer(Li et al., 2021), andFEDformer(Zhou et al., 2022), primarily focuses on modifying theattention moduleto improve efficiency (especially for long sequences) or adapt to temporal dependencies. For instance,InformerusesProbSparse self-attentionto reduce quadratic complexity.Autoformerintroduces anauto-correlation mechanismfor long-term series forecasting. -
Architectural Adaptation (e.g., inherent processing of time series): This category focuses on how time series data is prepared or structured for the Transformer. Examples include
Stationarization(Liu et al., 2022b) which tackles non-stationarity,Channel Independence(Nie et al., 2023), andPatching(Nie et al., 2023).Patchinginvolves splitting the input time series into smaller sub-series (patches) and treating these patches as tokens, aiming to enlarge the receptive field and capture local patterns.PatchTST(Nie et al., 2023) is a prominent example. -
Both Component and Architectural Modification:
Crossformer(Zhang & Yan, 2023) falls into this category, explicitly capturing cross-time and cross-variate dependencies through renovated attention mechanisms and architecture. -
No Component Modification, only Architectural:
iTransformeris presented as the first work in this category, keepingnative Transformer componentsbut inverting the dimensions they operate on.The paper also highlights the
boom of linear forecasting modelslikeN-BEATS(Oreshkin et al., 2019),DLinear(Zeng et al., 2023),TiDE(Das et al., 2023), andRLinear(Li et al., 2023). These models demonstrate that simple linear layers can achieve competitive or even superior performance and efficiency compared to complex Transformer architectures, challenging the "passion for architectural modifications of Transformer-based forecasters."
3.3. Technological Evolution
The field of time series forecasting has seen significant evolution:
- Statistical Methods: Traditionally, methods like
ARIMA(Box & Jenkins, 1968) andExponential Smoothingdominated. These rely on statistical properties and often assume linearity or specific data distributions. - Recurrent Neural Networks (RNNs) and Temporal Convolutional Networks (TCNs): With the rise of deep learning,
RNNs(likeLSTMsby Zhao et al., 2017, andDeepARby Salinas et al., 2020) andTCNs(Bai et al., 2018; SCINet by Liu et al., 2022a) became popular for their ability to model sequential data. - Transformers: Inspired by their success in NLP,
Transformerswere adapted for time series (Wu et al., 2021; Nie et al., 2023). Early adaptations often treatedtime pointsas tokens, similar to words in a sentence. However, this direct application revealed limitations, especially in multivariate settings wherevariatesat the same timestamp might not be semantically related or may exhibit phase shifts. - Linear Models Resurgence: The recent strong performance of
linear modelshas caused researchers to question the necessity and effectiveness of complex deep learning architectures for time series, especially the way Transformers were being applied. - Inverted Transformers (
iTransformer): This paper fits into the latest wave of evolution by suggesting that the Transformer's core components are powerful, but their application to time series needs a fundamental re-thinking oftokenizationanddimension. Byinvertingthe tokenization fromtemporaltovariate,iTransformerattempts to better align the Transformer's strengths with the intrinsic characteristics of multivariate time series.
3.4. Differentiation Analysis
iTransformer differentiates itself from other Transformer-based time series forecasters primarily by its novel tokenization strategy and the inversion of dimensions for Transformer components, while crucially not modifying the basic components themselves.
The following figure (Figure 2 from the original paper) shows a comparison between the vanilla Transformer (top) and the proposed iTransformer (bottom):
该图像是论文中展示的示意图,比较了传统Transformer与所提iTransformer的时间序列编码方式差异。传统Transformer以时间步为token嵌入多变量,关注时间依赖;iTransformer则反转维度,以变量为token进行嵌入,关注多变量间的关联。
Here's a breakdown of the core differences and innovations compared to main methods:
-
Traditional Transformer-based Forecasters (e.g., Autoformer, Informer, FEDformer, vanilla Transformer adaptations):
- Tokenization: These models typically embed
multiple variates of the same timestampinto a singletemporal token. Attention is then applied along thetemporal dimensionto capture time-wise dependencies. - Problems: As discussed in the paper, this can lead to
malpositionedandtoo localizedrepresentations,meaningless attention maps(especially when variates at a given timestamp are unrelated or have phase lags), performance degradation with large lookback windows, and computational explosion. - Differentiation:
iTransformerexplicitly rejects thistemporal tokenization. It views each entire time series of an individualvariateas atoken(variate token). Attention is then applied across thevariate dimensionto capture relationships between different variates. Thefeed-forward network(FFN) operates on thetemporal dimensionfor each variate token to learn its series representation.
- Tokenization: These models typically embed
-
PatchTST(Nie et al., 2023):- Tokenization:
PatchTSTusespatching, where segments (patches) of a time series are treated as tokens. This is still primarily atemporal tokenizationstrategy, albeit with a broader receptive field than single time points. - Differentiation: While
PatchTSTimproves uponvanillaTransformers by using patches,iTransformer'svariate tokenizationis a more fundamental shift.iTransformerargues thatPatchTST's patching mechanism might "lose focus on specific locality to handle rapid fluctuation" and can still introduce "interaction noises between time-unaligned patches from different multivariate."iTransformer's approach of treating a whole series as a token and using attention formultivariate correlationis distinct.
- Tokenization:
-
Crossformer(Zhang & Yan, 2023):- Tokenization/Mechanism:
Crossformerexplicitly aims to capture bothcross-timeandcross-variatedependencies, often through modified attention mechanisms and architecture. - Differentiation:
iTransformerargues thatCrossformer's explicit dual-dependency capture can still suffer from "interaction of time-unaligned patches from different multivariate" leading to "unnecessary noise."iTransformerprovides a cleaner separation of duties:self-attentionformultivariate correlationsandFFNfortemporal representationswithin each variate.
- Tokenization/Mechanism:
-
Linear-based models (e.g., DLinear, TiDE, RLinear):
-
Architecture: These are typically much simpler models, often relying on linear layers or MLPs for forecasting. They highlight the effectiveness of simple linear mappings for temporal dependencies.
-
Differentiation:
iTransformeracknowledges the power of linear models, particularly fortemporal dependencies. It integrates this insight by having thefeed-forward network(an MLP-like component) process thetemporal dimensionof each variate token. This effectively leverages thetemporal modelingcapabilities of linear layers while retaining themultivariate correlationpower ofattentionon thevariate dimension.In summary,
iTransformer's core innovation lies in itsinverted perspective, which fundamentally re-thinks thetokenizationanddimensionassignments for Transformer components in time series forecasting, leading to a more natural and effective alignment with the data's inherent structure.
-
4. Methodology
4.1. Principles
The core principle behind iTransformer is the inversion of how Transformer components (specifically self-attention and the feed-forward network) are applied to multivariate time series data, without modifying the intrinsic design of these components. The authors argue that previous Transformer-based forecasters incorrectly apply attention along the temporal dimension on temporal tokens (which combine all variates at a single timestamp). This approach leads to issues like meaningless attention maps due to heterogeneous variates and degraded performance with long sequences.
iTransformer's intuition is that:
-
Variate-centric Representations: Each individual time series (corresponding to a single variate) should be treated as a coherent unit or
token. This allows for learningvariate-centric representationsthat respect the distinct physical meanings and statistical distributions of each series. -
Multivariate Correlations: The
self-attention mechanism, which is excellent at capturing pairwise dependencies, should be applied across thesevariate tokensto modelmultivariate correlations(i.e., how different variates interact with each other over time). This provides a more interpretable and meaningful attention map. -
Temporal Representations: The
feed-forward network(FFN), which is proficient at learning nonlinear representations from individual inputs, should be applied to thetemporal dimensionof eachvariate tokento extract complex patterns and representations for that specific time series. This leverages the strengths ofMLP-based modelsfortemporal modeling. -
Layer Normalization for Discrepancies:
Layer normalizationshould be applied to the series representation of individual variates to reduce discrepancies caused by inconsistent measurements and tacklenon-stationary problems.By adopting this
inverted view,iTransformeraims to build a more robust, generalizable, and performant Transformer backbone for multivariate time series forecasting.
4.2. Core Methodology In-depth (Layer by Layer)
iTransformer adopts an encoder-only architecture, similar to the encoder of a vanilla Transformer, but with a crucial inversion of the dimensions that its core modules operate on.
The following figure (Figure 4 from the original paper) shows the overall structure of iTransformer:
该图像是图4的结构示意图,展示了iTransformer的整体架构。图中包含(a)原始多变量时间序列嵌入为变异变量令牌,(b)多变量自注意力机制及相关矩阵计算过程,(c)共享前馈网络提取序列表示,以及(d)采用时间层归一化降低变量间差异。
Given historical observations of a multivariate time series , where is the number of time steps (lookback length) and is the number of variates, the goal is to predict future time steps .
The overall process for iTransformer to predict future series for a specific variate based on its lookback series is formulated as follows:
$
\begin{array}{r}
\mathbf{h}{n}^{0} = \operatorname{Embedding}(\mathbf{X}{:,n}), \qquad \
\mathbf{H}^{l+1} = \operatorname{TrmBlock}(\mathbf{H}^{l}), l = 0, \dots, L-1, \
\hat{\mathbf{Y}}{:,n} = \operatorname{Projection}(\mathbf{h}{n}^{L}), \qquad
\end{array}
$
Where:
-
represents the entire time series for the -th variate over the lookback window .
-
is a function that transforms the raw time series of a variate into a higher-dimensional
variate tokenrepresentation. -
is the initial embedded
variate tokenfor the -th variate, where is the dimension of the token (model dimension). -
is the collection of all embedded
variate tokens. -
represents a single
iTransformer block, which processes thevariate tokens. -
denotes the layer index, ranging from
0toL-1, where is the total number ofiTransformer blocks. -
is the final representation of the -th
variate tokenafter passing through blocks. -
is a function that transforms the final
variate tokenrepresentation back into the predicted future series for that variate. -
is the predicted future series for the -th variate over the prediction length .
Let's break down each component:
4.2.1. Embedding Layer
- Input: The raw time series for each individual variate. For a specific variate , this is .
- Process: This layer independently embeds each variate's time series into a high-dimensional
variate token. - Implementation: It is implemented by a
Multi-Layer Perceptron (MLP). An MLP can learn complex nonlinear representations from the entire sequence of time points for a single variate. - Output: , which is the initial
variate tokenof dimension for the -th variate. - Role: This step transforms each individual time series into a semantically rich
variate tokenthat aggregates global representations of the series, ensuringvariate-centricityfrom the start.
4.2.2. iTransformer Block (TrmBlock)
Each iTransformer block consists of Layer Normalization, Self-Attention, and a Feed-Forward Network, arranged in a typical Transformer encoder block structure (pre-normalization style). These components, however, operate on the inverted dimensions.
4.2.2.1. Layer Normalization
- Role in iTransformer: Unlike traditional Transformer-based forecasters where
LayerNormmight normalize across variates at a single timestamp, iniTransformer,LayerNormis applied to the series representation of individual variates. This is crucial for handlingnon-stationarityand reducing discrepancies due toinconsistent measurementsacross different variates. By normalizing eachvariate tokento a Gaussian distribution, it helps ensure that theself-attentionmechanism is not unduly influenced by scale differences between variates. - Formula: For a collection of
variate tokens, theLayerNormoperation is applied to each independently: $ \mathrm{LayerNorm}(\mathbf{H}) = \left{ \frac { \mathbf { h } _ { n } - \mathrm { Mean } ( \mathbf { h } _ { n } ) } { \sqrt { \mathrm { Var } ( \mathbf { h } _ { n } ) + \epsilon } } \bigg | \ n = 1 , \ldots , N \right} $ Where:- is the
variate token(series representation) for the -th variate. - is the mean of the elements in the vector .
- is the variance of the elements in the vector .
- is a small constant added for numerical stability.
- The output is a set of normalized
variate tokens.
- is the
4.2.2.2. Self-Attention
- Role in iTransformer: In
iTransformer,self-attentionis applied across thevariate dimensionto capturemultivariate correlations. Eachvariate tokenacts as an input to the attention mechanism. - Input: The set of normalized
variate tokens. - Process:
Queries(),Keys(), andValues() are generated from . The attention mechanism then calculates pairwise similarity betweenvariate tokensto determine their interdependencies. - Formula (standard self-attention applied on variate tokens):
$
\mathrm{Attention}(\mathbf{H}) = \mathrm{softmax}\left(\frac{(\mathbf{H}W_Q)(\mathbf{H}W_K)^T}{\sqrt{d_k}}\right)(\mathbf{H}W_V)
$
Where:
- is the matrix of
variate tokensfrom the previous layer. - are learnable weight matrices used to project into
Query,Key, andValuematrices respectively, each of dimension . - is the number of
variate tokens. - is the dimension of the projected
QueryandKeyvectors. - The resulting attention map (before
softmax) has entries , where and areQueryandKeyvectors for variates and . This directly reflects the correlation between variate and variate .
- is the matrix of
- Output: The updated
variate tokensafterself-attention, typically combined with residual connection andLayerNorm. - Enhanced Interpretability: This design means that the attention weights directly reflect the correlations between different variates, making the attention map highly interpretable for understanding
multivariate correlations.
4.2.2.3. Feed-Forward Network (FFN)
- Role in iTransformer: The
FFNis applied independently to eachvariate token, effectively operating on thetemporal dimension(implicitly, as encodes the entire time series of variate ). - Input: The output of the
self-attentionsub-layer (afterLayerNormand residual connection) for eachvariate token. - Process: Each
variate tokenpasses through an identical FFN, which typically consists of two linear transformations with an activation function (e.g., ReLU) in between. - Implementation: The FFN is shared across all
variate tokens. This means the same FFN learns intrinsic patterns (like amplitude, periodicity, frequency spectrums) that are generalizable across different time series. - Output: Enhanced
variate tokensthat have learned richer nonlinear representations from their respective time series. - No Positional Encoding: Because each
variate tokenalready encodes the entire time series of its corresponding variate (implicitly capturing temporal order within its representation), and theFFNworks identically on independent variates, explicitpositional encoding(as used invanilla Transformersfortemporal tokens) is not needed foriTransformer.
4.2.3. Projection Layer
- Input: The final
variate tokensafteriTransformer blocks. - Process: This layer transforms each final
variate tokeninto its corresponding predicted future time series. - Implementation: It is implemented by an
MLP. - Output: , which is the predicted future time steps for the -th variate.
- Role: This step converts the learned
variate-centricrepresentations into concrete future forecasts. The fact that a simple linear layer (MLP) is sufficient here is consistent with the findings of successfullinear forecasters.
4.2.4. Algorithm Overview
The pseudo-code for iTransformer summarizes these steps:
The following are the results from Table Algorithm 1 of the original paper:
| Algorithm 1 iTransformer - Overall Architecture. | |
| Require: Input lookback time series X RT ×N ; input Length T; predicted length S; variates | |
| number N; token dimension D; iTransformer block number L. | |
| 1: X=X.transpose() | >X RN×T |
| 2: | Multi-layer Perceptron works on the last dimension to embed series into variate tokens. |
| 3: H0 = MLP(X) | H0 RN×D |
| 4: for l in {1, . . . , L}: | Run through iTransformer blocks. |
| 5: | Self-attention layer is applied on variate tokens. |
| 6: | Hl−1 = LayerNorm(Hl−1 + Self−Attn(Hl−1)) |
| 7: | Feed-forward network is utilized for series representations, broadcasting to each token. H RN×D |
| 8: | Hl = LayerNorm(Hl−1 + Feed-Forward(Hl−1)) |
| 9: | LayerNorm is adopted on series representations to reduce variates discrepancies. |
| 10: End for | |
| 11: = MLP(HL) Project tokens back to predicted series, RN×S | |
| 12: =.transpose() | |
| 13: Return | Return the prediction result |
Step-by-step Explanation of Algorithm 1:
-
Input Data Transposition (Line 1):
- The input
lookback time seriesis initially in shape (Time steps Variates). - It is immediately transposed to . This reorients the data such that each row now represents an individual variate's entire historical time series, which is crucial for
variate-centric tokenization.
- The input
-
Embedding Layer (Lines 2-3):
- A
Multi-layer Perceptron (MLP)is applied to the transposed input . This MLP operates on the last dimension (which is , the length of each variate's time series). - For each variate's time series (row of the transposed ), the MLP generates an initial
variate token. - The output is a matrix of shape , where each row is a
variate tokenof dimension .
- A
-
iTransformer Blocks (Lines 4-9):
- The algorithm then iterates through
iTransformer blocks(from to ). In Transformer terminology, would be the input to the current block, and the output. - Self-Attention Sub-layer (Lines 5-6):
- The
Self-Attentionlayer is applied to the current set ofvariate tokens. As explained, this attention mechanism capturesmultivariate correlationsby computing relationships between differentvariate tokens. - A
residual connection() is added, followed byLayer Normalization(). The output is updated back into for the next sub-layer. (Note: The notation for the output of the first sub-layer is common in pre-LN Transformer implementations, implying is updated in-place before the FFN stage).
- The
- Feed-Forward Network Sub-layer (Lines 7-8):
- The
Feed-Forward Network (FFN)is then applied. This FFN processes eachvariate tokenindependently to learnseries representations. It is shared across all tokens. - Again, a
residual connectionis added (), followed byLayer Normalization(). - The final output of the -th block is , which becomes the input to the next block.
- The
- The algorithm then iterates through
-
Projection Layer (Lines 11-12):
- After passing through all
iTransformer blocks, the finalvariate tokensare fed into anotherMLP. - This
MLPprojects eachvariate token(of dimension ) back into its predicted future series (of length ). The output of this MLP is . - The result is then transposed back from to to match the conventional output format for time series forecasting, where rows are time steps and columns are variates.
- After passing through all
-
Return (Line 13):
-
The model returns the predicted future time series .
This detailed breakdown shows how
iTransformersystematically re-configures the Transformer's internal processing flow to align with the characteristics of multivariate time series data, leveraging the strengths of its components in aninvertedmanner.
-
5. Experimental Setup
5.1. Datasets
The experiments in the paper used a total of 7 real-world public datasets and an additional set of 6 Market datasets.
The following are the results from Table 4 of the original paper:
| Dataset | Dim | Prediction Length | Dataset Size | Frequency | Information |
| ETTh1, ETTh2 | 7 | {96, 192, 336, 720} | (8545, 2881, 2881) | Hourly | Electricity |
| ETTm1, ETTm2 | 7 | {96, 192, 336, 720} | (34465, 11521, 11521) | 15min | Electricity |
| Exchange | 8 | {96, 192, 336, 720} | (5120, 665, 1422) | Daily | Economy |
| Weather | 21 | {96, 192, 336, 720} | (36792, 5271, 10540) | 10min | Weather |
| ECL | 321 | {96, 192, 336, 720} | (18317, 2633, 5261) | Hourly | Electricity |
| Traffic | 862 | {96, 192, 336, 720} | (12185, 1757, 3509) | Hourly | Transportation |
| Solar-Energy | 137 | {96, 192, 336, 720} | (36601, 5161, 10417) | 10min | Energy |
| PEMS03 | 358 | {12, 24, 48, 96} | (15617, 5135, 5135) | 5min | Transportation |
| PEMS04 | 307 | {12, 24, 48, 96} | (10172, 3375, 3375) | 5min | Transportation |
| PEMS07 | 883 | {12, 24, 48, 96} | (16911, 5622, 5622) | 5min | Transportation |
| PEMS08 | 170 | {12, 24, 48, 96} | (10690, 3548, 3548) | 5min | Transportation |
| Market-Merchant | 285 | {12, 24, 72, 144} | (7045, 1429, 1429) | 10min | Transaction |
| Market-Wealth | 485 | {12, 24, 72, 144} | (7045, 1429, 1429) | 10min | Transaction |
| Market-Finance | 405 | {12, 24, 72, 144} | (7045, 1429, 1429) | 10min | Transaction |
| Market-Terminal | 307 | {12, 24, 72, 144} | (7045, 1429, 1429) | 10min | Transaction |
| Market-Payment | 759 | {12, 24, 72, 144} | (7045, 1429, 1429) | 10min | Transaction |
| Market-Customer | 395 | {12, 24, 72, 144} | (7045, 1429, 1429) | 10min | Transaction |
Detailed descriptions:
-
ETT (ETTh1, ETTh2, ETTm1, ETTm2): Contains 7 factors related to electricity transformers.
ETTh1andETTh2are hourly,ETTm1andETTm2are 15-minute sampled. These are commonly used for assessing long-term forecasting capabilities. -
Exchange: Daily exchange rates from 8 countries. Represents economic time series.
-
Weather: 21 meteorological factors, sampled every 10 minutes. A typical multi-variate natural phenomenon dataset.
-
ECL (Electricity Consuming Load): Hourly electricity consumption data from 321 clients. High-dimensional electricity demand forecasting.
-
Traffic: Hourly road occupancy rates from 862 sensors in the San Francisco Bay area. A very high-dimensional transportation dataset with complex spatial-temporal dependencies.
-
Solar-Energy: Solar power production of 137 PV plants, sampled every 10 minutes. Energy forecasting.
-
PEMS (PEMS03, PEMS04, PEMS07, PEMS08): Public traffic network data in California, collected in 5-minute windows. Similar to Traffic, but often used for shorter prediction horizons.
-
Market (6 subsets): Minute-sampled server load of Alipay online transaction applications, with hundreds of variates (285 to 759). These are custom, high-dimensional, real-world
transactiondatasets, designed to test model performance in complex industrial settings.The datasets were chosen to cover a wide range of domains (electricity, economy, weather, transportation, energy, transaction), varying dimensions (from 7 to 883 variates), and different sampling frequencies, making them suitable for evaluating the robustness and generalizability of forecasting models. The inclusion of high-dimensional datasets like Traffic, ECL, PEMS, and Market is particularly important for
iTransformer's claims about handling multivariate correlations.
Data processing and train-validation-test splits follow standard chronological division protocols to avoid data leakage. The lookback length () is fixed at 96 for most datasets (except Market), and prediction lengths () vary from {12, 24, 36, 48} for PEMS and {96, 192, 336, 720} for others. For Market data, and varies in {12, 24, 72, 144}.
5.2. Evaluation Metrics
The paper uses two common metrics for time series forecasting: Mean Squared Error (MSE) and Mean Absolute Error (MAE). Lower values for both metrics indicate better forecasting accuracy.
-
Mean Squared Error (MSE)
- Conceptual Definition:
MSEmeasures the average of the squares of the errors. It quantifies the average magnitude of the errors, where larger errors contribute disproportionately more to the total error due to the squaring. It is particularly sensitive to large errors or outliers. - Mathematical Formula: $ \mathrm{MSE} = \frac{1}{N \cdot S} \sum_{i=1}^{N} \sum_{t=1}^{S} (y_{i,t} - \hat{y}_{i,t})^2 $
- Symbol Explanation:
- : The total number of variates (time series).
- : The prediction length (number of time steps to forecast).
- : The true (actual) value for variate at time step .
- : The predicted value for variate at time step .
- : Summation over all variates and all predicted time steps.
- Conceptual Definition:
-
Mean Absolute Error (MAE)
- Conceptual Definition:
MAEmeasures the average of the absolute differences between predictions and actual observations. It gives an idea of the average magnitude of the error without considering its direction. Unlike MSE, it is less sensitive to outliers, as it does not square the errors. - Mathematical Formula: $ \mathrm{MAE} = \frac{1}{N \cdot S} \sum_{i=1}^{N} \sum_{t=1}^{S} |y_{i,t} - \hat{y}_{i,t}| $
- Symbol Explanation:
-
: The total number of variates (time series).
-
: The prediction length (number of time steps to forecast).
-
: The true (actual) value for variate at time step .
-
: The predicted value for variate at time step .
-
: Absolute value.
-
: Summation over all variates and all predicted time steps.
These metrics are effective for validating a method's performance by providing quantitative measures of prediction accuracy.
MSEpenalizes larger errors more, which can be important in applications where large deviations are critical, whileMAEprovides a more robust average error magnitude.
-
- Conceptual Definition:
5.3. Baselines
The paper compares iTransformer against 10 well-acknowledged forecasting models, representing different categories of deep forecasters for time series:
-
Transformer-based methods:
Autoformer(Wu et al., 2021): Utilizes decomposition andauto-correlationmechanisms for long-term forecasting.FEDformer(Zhou et al., 2022): Incorporatesfrequency-enhanced decomposedTransformer for improved long-term forecasting.Stationary(Liu et al., 2022b): Addressesnon-stationarityin time series data using adaptive normalization techniques within a Transformer framework.Crossformer(Zhang & Yan, 2023): Designed to explicitly capturecross-timeandcross-variate dependencies.PatchTST(Nie et al., 2023): Appliespatching(dividing series into sub-series) to Transformers for long-term forecasting, treating patches as tokens.
-
Linear-based methods: These represent simpler yet effective models that have recently challenged complex deep learning architectures.
DLinear(Zeng et al., 2023): A purely linear model that decomposes time series into trend and remainder components and forecasts them separately.TiDE(Das et al., 2023):Time-series Dense Encoder, a simple yet powerful model primarily based onMLPs.RLinear(Li et al., 2023): Revisits linear mapping for long-term time series forecasting, emphasizing simplicity and effectiveness.
-
TCN-based methods: These models use convolutional neural networks specifically designed for sequential data.
-
SCINet(Liu et al., 2022a):Sample Convolution and Interaction Network, aTCN-based model using interactive downsampling and convolution for time series modeling. -
TimesNet(Wu et al., 2023): Modelstemporal 2D-variationsby transforming 1D time series into 2D tensors for generalized time series analysis.These baselines are representative because they cover the spectrum of modern deep learning approaches to time series forecasting, including various Transformer adaptations, highly competitive linear models, and strong convolutional approaches. This diverse set allows for a comprehensive evaluation of
iTransformer's performance across different architectural paradigms and inherent assumptions about time series structure.
-
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that iTransformer consistently achieves state-of-the-art performance across various challenging real-world datasets, particularly excelling in forecasting high-dimensional time series. This validates the effectiveness of its inverted architecture.
The following are the results from Table 1 of the original paper:
| Models | iTransformer (Ours) | RLinear (2023) | PatchTST (2023) | Crossformer (2023) | TiDE (2023) | TimesNet (2023) | DLinear (2023) | SCINet (2022a) | FEDformer (2022) | Stationary (2022b) | Autoformer (2021) |
| | MSE MAE | | | MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE | ||||||||||
| ECL | | 0.178 0.270 | | 0.205 0.290 | 0.244 0.334 | 0.251 0.340 | 0.219 0.298 | 0.240 0.330 | 0.250 0.339 | 0.257 0.331 | 0.290 0.380 | 0.254 0.361 | |||||||||
| ETT | | 0.383 0.407 | | 0.414 0.407 | 0.387 0.400 | 0.513 0.496 | 0.403 0.407 | 0.398 0.404 | 0.400 0.403 | 0.481 0.452 | 0.481 0.456 | 0.507 0.500 | 0.553 0.496 | |||||||||
| Exchange | | 0.360 0.403 | | 0.378 0.417 | 0.367 0.404 | 0.940 0.707 | 0.370 0.417 | 0.354 0.414 | 0.358 0.404 | 0.780 0.619 | 0.461 0.454 | 0.610 0.570 | 0.804 0.637 | |||||||||
| PEMS | | 0.113 0.221 | | 0.180 0.291 | 0.169 0.281 | 0.326 0.419 | 0.147 0.248 | 0.278 0.375 | 0.114 0.224 | 0.213 0.327 | 0.147 0.249 | 0.667 0.601 | | |||||||||
| Solar-Energy | | 0.233 0.262 | | 0.270 0.307 | 0.261 0.301 | 0.641 0.639 | 0.347 0.417 | 0.290 0.330 | 0.301 0.340 | 0.639 0.639 | 0.381 0.381 | 0.700 0.601 | 0.801 0.610 | |||||||||
| Traffic | | 0.428 0.282 | | 0.626 0.378 | 0.481 0.304 | 0.550 0.304 | 0.476 0.300 | 0.650 0.396 | 0.580 0.366 | 0.612 0.338 | 0.756 0.474 | 0.617 0.336 | 0.789 0.505 | |||||||||
| Weather | | 0.258 0.278 | | 0.272 0.291 | 0.259 0.281 | 0.259 0.315 | 0.260 0.280 | 0.260 0.292 | 0.259 0.280 | 0.363 0.330 | 0.360 0.336 | 0.292 0.280 | 0.360 0.330 |
(Note: The original Table 1 had some formatting issues, especially for the multi-column header and alignment. I have transcribed it as best as possible, focusing on data fidelity.)
Key observations from Table 1:
-
Superior Performance:
iTransformerachieves the best (red text) or second-best (underlined)MSE/MAEscores across almost all datasets. It holds the top spot for ECL, ETT, Exchange, PEMS, Solar-Energy, Traffic, and Weather, demonstrating its broad applicability and effectiveness. -
High-Dimensional Series: The model performs particularly well on datasets with a large number of
variates, such asECL(321 variates),Traffic(862 variates), andPEMS(358-883 variates subsets). This supports the claim thatiTransformeris effective at modelingmultivariate correlationsby applying attention on thevariate dimension. -
PatchTSTLimitations:PatchTST, a previous state-of-the-art model, surprisingly fails in many cases on thePEMSdataset. The paper attributes this to the extremely fluctuating nature ofPEMSseries, suggesting thatPatchTST'spatchingmechanism might lose focus on specific local details necessary to handle rapid fluctuations. In contrast,iTransformer, which aggregateswhole series variationsforseries representations, copes better with such situations. -
Crossformervs.iTransformer: DespiteCrossformerexplicitly capturingcross-timeandcross-variate dependencies, its performance is stillsubpartoiTransformer. This suggests thatiTransformer'sinvertedarchitecture provides a more robust and less noisy way to handle multivariate interactions.In essence, the results strongly validate that
iTransformer'sinvertedarchitecture provides a robust solution for real-worldmultivariate time series forecasting, outperforming both complex Transformer variants and simpler linear models.
6.2. iTransformers Generality
The paper further explores the generality of the inverted framework by applying it to various Transformer variants, evaluating variate generalization, and assessing performance with increased lookback length.
6.2.1. Performance Promotion
The inverted framework consistently improves the performance of various Transformer models.
The following are the results from Table 2 of the original paper:
| Models | Transformer (2017) | Reformer (2020) | Informer (2021) | Flowformer (2022) | Flashformer (2022) | ||||||
| MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | ||
| ECL | Original | 0.277 | 0.372 | 0.338 | 0.422 | 0.311 | 0.397 | 0.267 | 0.359 | 0.285 | 0.377 |
| +Inverted | 0.178 | 0.270 | 0.208 | 0.301 | 0.216 | 0.311 | 0.210 | 0.293 | 0.206 | 0.291 | |
| Promotion | 35.6% | 27.4% | 38.4% | 28.7% | 30.5% | 21.6% | 21.3% | 18.6% | 27.8% | 22.9% | |
| Traffic | Original | 0.665 | 0.363 | 0.741 | 0.422 | 0.764 | 0.416 | 0.750 | 0.421 | 0.658 | 0.356 |
| +Inverted | 0.428 | 0.282 | 0.647 | 0.370 | 0.662 | 0.380 | 0.524 | 0.355 | 0.492 | 0.333 | |
| Promotion | 35.6% | 22.3% | 12.7% | 12.3% | 13.3% | 8.6% | 30.1% | 15.6% | 25.2% | 6.4% | |
| Weather | Original | 0.657 | 0.572 | 0.803 | 0.656 | 0.634 | 0.548 | 0.286 | 0.308 | 0.659 | 0.574 |
| +Inverted | 0.258 | 0.279 | 0.248 | 0.292 | 0.271 | 0.330 | 0.266 | 0.285 | 0.262 | 0.282 | |
| Promotion | 60.2% | 50.8% | 69.2% | 55.5% | 57.3% | 39.8% | 7.2% | 7.7% | 60.2% | 50.8% | |
- The
invertedframework leads to significantMSE reductions(Promotions) across all tested Transformer variants: 38.9% onTransformer, 36.1% onReformer, 28.5% onInformer, 16.8% onFlowformer, and 32.2% onFlashformer. This robust improvement highlights that the issue was not the Transformer components themselves, but their conventional application to time series. - The framework allows for the integration of
efficient attention mechanisms(likeFlowformerorFlashAttention) to address computational complexity when the number ofvariatesis large, making it practical for real-world high-dimensional datasets.
6.2.2. Variate Generalization
iTransformer demonstrates strong generalization capabilities on unseen variates.
The following figure (Figure 5 from the original paper) shows the performance of generalization on unseen variates:
该图像是图表,展示了图5中iTransformer模型在不同数据集(ECL、Traffic、Solar-Energy)上对未见变量的泛化能力。图中比较了Transformer、Informer、Reformer和Flowformer模型,在所有变量(100%)和部分变量(20%)条件下的均方误差(MSE)表现。
- Flexibility: By design,
iTransformer'svariate tokensallow a flexible number of inputs, meaning it can be trained on a subset ofvariatesand then forecast allvariates(includingunseen ones) without fine-tuning. - Transferable Representations: The
shared feed-forward networkslearn intrinsic patterns of time series that aretransferableacross differentvariates. - Comparison with
CI-Transformers: When comparingiTransformerswithChannel Independence (CI) Transformers(where a shared backbone forecasts each variate independently),iTransformersare shown to be more efficient.CI-Transformersoften require predicting each variate one by one during inference, which is time-consuming.iTransformerscan predict all variates simultaneously and show smaller increases in error when generalizing from partial training. This suggests a path toward buildingfoundation modelsfor time series based oniTransformer.
6.2.3. Increasing Lookback Length
The inverted framework enables Transformer models to effectively utilize enlarged lookback windows for better predictions, a challenge for conventional Transformer adaptations.
The following figure (Figure 6 from the original paper) shows forecasting performance with increased lookback length:
该图像是图表,展示了不同模型在ECL和Traffic两个数据集上,随回望长度变化的预测性能,纵轴为均方误差(MSE)。结果显示iTransformer在扩大回望窗口时性能明显优于其他Transformer及其变体。
Vanilla Transformer-based forecastersoften do not benefit from increasedlookback length(), sometimes even degrading performance, possibly due todistracted attention.iTransformerand its variants, however, show improved performance as thelookback lengthincreases (). This indicates that the strategic assignment of theMLPsto thetemporal dimensionallows the model to leverage more historical information, a characteristic traditionally seen in strongstatisticalandlinearmethods.
6.3. Model Analysis
6.3.1. Ablation Study
Ablation studies confirm the rational assignment of Transformer components in iTransformer.
The following are the results from Table 3 of the original paper:
| Design | Variate | Temporal | ECL | Traffic | Weather | Solar-Energy | ||||
| MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | |||
| iTransformer | Attention | FFN | 0.178 | 0.270 | 0.428 | 0.282 | 0.258 | 0.278 | 0.233 | 0.262 |
| Replace | Attention | Attention | 0.193 | 0.293 | 0.913 | 0.500 | 0.255 | 0.280 | 0.261 | 0.291 |
| FFN | Attention | 0.202 | 0.300 | 0.863 | 0.499 | 0.258 | 0.283 | 0.285 | 0.317 | |
| FFN | FFN | 0.182 | 0.287 | 0.599 | 0.348 | 0.248 | 0.274 | 0.269 | 0.287 | |
| w/o | Attention w/o | FFN | 0.189 | 0.278 | 0.456 | 0.306 | 0.261 | 0.281 | 0.258 | 0.289 |
| Attention | w/o FFN | 0.193 | 0.276 | 0.461 | 0.294 | 0.265 | 0.283 | 0.261 | 0.283 | |
(Note: The original Table 3 had some formatting issues, especially for the multi-column header and alignment. I have transcribed it as best as possible, focusing on data fidelity.)
The full results from Appendix B (Table 6) further elaborate on these findings.
The following are the results from Table 6 of the original paper:
| Design | Variate | Temporal | | Prediction | | ECL | Traffic | Weather | | Solar-Energy |
| Lengths | | MSE MAE | | MSE MAE | MSE MAE | MSE MAE | |||
| iTransformer | Attention | FFN | 96 | | 0.148 0.240 | | 0.395 0.268 | | 0.174 0.214 | | 0.203 0.237 |
| Avg | | 0.178 0.270 | | | 0.428 0.282 | | | 0.258 0.279 | | | 0.233 0.262 | | |||
| Replace | Attention | Attention | 96 | | 0.161 0.263 | | | 1.021 0.581 | | | 0.168 0.213 | | | 0.227 0.270 | |
| Avg | | 0.193 0.293 | | | 0.913 0.500 | | | 0.255 0.280 | | | 0.261 0.291 | | |||
| FFN | Attention | 96 | | 0.169 0.270 | | | 0.907 0.540 | | | 0.176 0.221 | | 0.247 0.299 | |
| Avg | | 0.202 0.300 | | | 0.863 0.499 | | | 0.258 0.283 | | | 0.285 0.317 | | |||
| FFN | FFN | 96 | | 0.159 0.261 | | | 0.606 0.342 | | | 0.162 0.207 | | | 0.237 0.277 | | |
| Avg | | 0.182 0.287 | | | 0.599 0.348 | | | 0.248 0.274 | | | 0.269 0.287 | | |||
| w/o | Attention | FFN | 96 | | 0.163 0.254 | | | 0.427 0.296 | | | 0.177 0.219 | | | 0.226 0.266 | |
| Avg | | 0.189 0.278 | | | 0.456 0.306 | | | 0.261 0.281 | | | 0.258 0.289 | | |||
| w/o | Attention | w/o FFN | 96 | | 0.169 0.253 | | | 0.437 0.283 | | | 0.183 0.220 | | | 0.228 0.263 | |
| Avg | | 0.193 0.276 | | | 0.461 0.294 | | | 0.265 0.283 | | | 0.261 0.283 | |
iTransformer(Attention on Variate, FFN on Temporal) consistently yields the best performance.- The
vanilla Transformerdesign (Attention on Temporal, FFN on Temporal, which is equivalent to 'Replace: Attention on Variate, Attention on Temporal' in the table if you map the dimensions, or if interpreted as applying attention on temporal tokens, which leads toAttention-Attentiondesign) performs the worst on Traffic, highlighting the risks of conventional architectures for time series. - Replacing
AttentionwithFFNon thevariate dimension(Replace: FFN on Variate, Attention on Temporal) leads to worse performance thaniTransformer, especially on Traffic. This emphasizes the critical role ofattentionin capturingmultivariate correlations. - Replacing
FFNwithAttentionon thetemporal dimension(Replace: Attention on Variate, Attention on Temporal) also shows degraded performance, suggestingFFNis better suited fortemporal series representationlearning. - Removing
Attention(w/o Attention) orFFN(w/o FFN) both degrade performance, confirming that both components are essential and their specific roles iniTransformerare optimal.
6.3.2. Analysis of Series Representations
The paper uses Centered Kernel Alignment (CKA) similarity to analyze the learned series representations. CKA measures the similarity between representations of different layers in a neural network. Higher CKA similarity between initial and final layer representations has been correlated with better performance in generative tasks like time series forecasting.
The following figure (Figure 7 from the original paper) shows an analysis of series representations and multivariate correlations:
该图像是论文中图7的图表,左侧为不同模型的MSE与CKA相似度比较,展示iTransformer在表示学习上的优势;右侧为多变量时间序列的原始相关性与通过倒置自注意力学习的得分图的可视化。
- As shown in Figure 7 (Left),
iTransformersexhibit higherCKA similaritycompared tovanilla Transformers. This indicates thatiTransformerslearn more appropriate and stableseries representations, which contributes to their improved forecasting accuracy. This supports the idea that theinvertedarchitecture helps maintain and refine meaningful representations throughout the network.
6.3.3. Analysis of Multivariate Correlations
By assigning the duty of multivariate correlation to the attention mechanism on variate tokens, iTransformer gains enhanced interpretability.
The following figure (Figure 11 from the original paper) shows multivariate correlations of the lookback series and future series and the learned score maps by inverted self-attention of different layers:
该图像是图表,展示了论文中案例1到案例3的回顾期和未来时间序列的多变量相关性及不同层的反转自注意力得分图,数据来自Solar-Energy数据集。
-
Interpretability: Figure 7 (Right) and Figure 11 show visualizations of
attention score mapsforSolar-Energydata. In shallow layers, the learned attention map closely resembles the actualPearson correlationof the rawlookback series. As the model delves into deeper layers, the attention map progressively aligns more with the correlations of thefuture seriesto be predicted. -
Encoding-Decoding Process: This phenomenon suggests that the
iTransformereffectively usesattentionto learn and evolvemultivariate correlationsfrom the past to predict the future. The encoding of past patterns and decoding for future predictions are implicitly conducted throughseries representationsduring thefeed-forwardingand layer stacking.The following figure (Figure 12 from the original paper) shows a visualization of the variates from the Market dataset and the learned multivariate correlations:
该图像是图表,展示了Market数据集中两个案例中若干变参的时间序列曲线及其学习到的多变量相关性热力图。左侧为时间序列走势,右侧为对应的相关性矩阵,图中标注了注意力机制关注的特定变参对。 -
Figure 12, using the
Marketdataset, further demonstrates this. The learnedmultivariate correlation mapclearly showspartitioningcorresponding to application categories, and marked correlations reflecthigh similaritybetweenvariatesfrom the same application. This confirms thatiTransformer'sattentionmechanism can identify and leverage meaningful correlations betweenvariatesfor forecasting.
6.3.4. Efficient Training Strategy
To address the quadratic complexity of self-attention with numerous variates, iTransformer proposes an efficient training strategy.
The following figure (Figure 8 from the original paper) shows an analysis of the efficient training strategy:
该图像是图表,展示了图8中不同采样比例下部分训练变量的性能(左图,MSE)和内存占用(右图,GB)变化。结果显示性能保持稳定,而内存占用显著下降,右图中交通数据的内存下降趋势在插图中放大展示。
- Partial Variate Training: The strategy involves randomly choosing only a part of the
variatesin each batch for training. Due to the flexibility ofvariate tokenization, the model can still predict allvariatesduring inference. - Performance vs. Memory: Figure 8 illustrates that this strategy maintains comparable performance (left chart) while significantly reducing the
memory footprint(right chart). This makesiTransformermore scalable for extremely high-dimensional datasets common in real-world applications.
6.4. Model Efficiency
The paper provides a comprehensive comparison of iTransformer's efficiency against other models.
The following figure (Figure 10 from the original paper) shows model efficiency comparison:
该图像是图表,展示了在输入-96预测-96任务中,Weather(21种变量)和Traffic(862种变量)数据集上多种模型的效率对比,包括训练时间、MSE和显存占用,iTransformer在多个维度表现优异。
- On datasets with a relatively small number of
variates(e.g.,Weatherwith 21 variates),iTransformeris more efficient than other Transformers. - On datasets with numerous
variates(e.g.,Trafficwith 862 variates),iTransformer'smemory footprintis similar to other Transformers. However,iTransformercan be trained faster. This is because the complexity of theattention moduleis where is the number of tokens. Forvanilla Transformers, (lookback length, e.g., 96), but foriTransformer, is the number ofvariates(e.g., 862 for Traffic), which is much larger. - The use of
linear-complexity attentionmechanisms (like those inFlowformer) or the proposedefficient training strategy(partial variate training) can further bringiTransformer's speed and memory footprint in line withlinear models, while still offering superior performance on manymultivariatetasks.
6.5. Showcases
Visual prediction showcases further highlight iTransformer's superior performance.
The following figure (Figure 13 from the original paper) shows a visualization of input-96-predict-96 results on the Traffic dataset:
该图像是多模型预测效果的折线图对比,展示了iTransformer、PatchTST、DLinear、Crossformer、Autoformer和Transformer在Traffic数据集上的input-96-predict-96预测结果。
The following figure (Figure 14 from the original paper) shows a visualization of input-96-predict-96 results on the ECL dataset:
该图像是对比6种时间序列预测模型在ECL数据集上输入96步预测96步结果的折线图。图中展示了iTransformer、PatchTST、DLinear、Crossformer、Autoformer及Transformer的预测曲线与真实曲线的对比,反映各模型预测的准确性。
The following figure (Figure 15 from the original paper) shows a visualization of input-96-predict-96 results on the Weather dataset:
该图像是论文中展示的多个时间序列预测模型预测结果对比的折线图,显示在不同模型下预测值与真实值的匹配情况。图中包含iTransformer、PatchTST、DLinear、Crossformer、Autoformer和Transformer六种模型的结果对比。
The following figure (Figure 16 from the original paper) shows a visualization of input-96-predict-96 results on the PEMS dataset:
该图像是多个时间序列预测模型的预测曲线和真实值曲线对比图,展示了iTransformer及其他五种模型(PatchTST、DLinear、Crossformer、Autoformer、Transformer)在PEMS数据集上input-96-predict-96任务的表现。
- Figures 13-16 present visual comparisons of predictions from
iTransformerand several baseline models (PatchTST,DLinear,Crossformer,Autoformer,Transformer) onTraffic,ECL,Weather, andPEMSdatasets. - These visualizations consistently show that
iTransformerproduces the most precise future series variations, closely tracking the ground truth, compared to other models. This qualitative evidence reinforces the quantitativeSOTAresults.
6.6. Risks of Embedding Multivariate Points of A Timestamp
The paper provides a qualitative analysis of why the traditional temporal token embedding approach is problematic, especially with real-world complexities.
The following figure (Figure 17 from the original paper) shows a visualization of partial variates of Traffic:
该图像是图表,展示了Traffic数据集中部分传感器的路段占用率随时间变化的曲线图。图中可以观察到部分序列(如传感器2和4)表现出较强的同步性,同时也存在明显的时间延迟和提前现象(如传感器1与传感器2,传感器859与传感器861)。
- Figure 17, from the
Trafficdataset, illustrates scenarios wheremultivariate time seriesexhibit strong correlations but alsoobvious phase offsetsorsystematical time lags(e.g., Sensor 1 and Sensor 2). - In such cases, embedding
simultaneous time pointsinto a singletemporal token(as done invanilla Transformers) can lead tomeaningless attention mapsbecause these points do not represent thesame eventat thesame time. This "interaction noise" degrades performance. iTransformermitigates this by embedding thewhole seriesof eachvariateas atoken. This makes it more robust to complexities likedelayed events,inconsistent measurements,irregularly spaced time series, andsystematic delaysacrossvariates.
6.7. Data Presentation (Tables)
6.7.1. Full Promotion Results (Table 7)
The following are the results from Table 7 of the original paper:
| Datasets | Metric | ETT | ECL | PEMS | Solar-Energy | Traffic | Weather | ||||||
| MSE | MAE | MSE | MAE | MSE | MAE | | MSE | MAE | MSE | MAE | MSE | MAE | ||
| Transformer | 2.750 | 1.375 | 0.277 | 0.372 | 0.157 | 0.263 | | 0.256 | 0.276 | 0.665 | 0.363 | 0.657 | 0.572 | |
| iTransformer | 0.383 | 0.407 | 0.178 | 0.270 | 0.113 | 0.221 | 0.233 | 0.262 | 0.428 | 0.282 | 0.258 | 0.279 | |
| Promotion | | 86.1% | 70.4% | | | 35.6% | 27.4% | 28.0% | 16.0% | | 9.0% | 5.1% | | 35.6% | 22.3% | 60.2% | 50.8% | |
6.7.2. Full Framework Generality Results (Table 8)
The following are the results from Table 8 of the original paper:
| Models Metric | Transformer (2017) | Reformer (2020) | Informer (2021) | Flowformer (2022) | Flashformer (2022) | |||||||
| MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | |||
| ECL | Original | 96 | 0.260 | 0.358 | 0.312 | 0.402 | 0.274 | 0.368 | 0.215 | 0.320 | 0.259 | 0.357 |
| 192 | 0.266 | 0.367 | 0.348 | 0.433 | 0.296 | 0.386 | 0.259 | 0.355 | 0.274 | 0.374 | ||
| Avg | 0.277 | 0.372 | 0.338 | 0.422 | 0.311 | 0.397 | 0.267 | 0.359 | 0.285 | 0.377 | ||
| +Inverted | 96 | 0.148 | 0.240 | 0.182 | 0.275 | 0.190 | 0.286 | 0.183 | 0.267 | 0.178 | 0.265 | |
| Avg | 0.178 | 0.270 | 0.208 | 0.301 | 0.216 | 0.311 | 0.210 | 0.293 | 0.206 | 0.291 | ||
| Traffic | Original | 96 | 0.647 | 0.357 | 0.732 | 0.423 | 0.719 | 0.367 | 0.493 | 0.339 | 0.464 | 0.320 |
| 192 | 0.649 | 0.356 | 0.733 | 0.420 | 0.696 | 0.391 | 0.691 | 0.393 | 0.641 | 0.348 | ||
| Avg | 0.665 | 0.363 | 0.741 | 0.422 | 0.764 | 0.416 | 0.750 | 0.421 | 0.658 | 0.356 | ||
| +Inverted | 96 | 0.395 | 0.268 | 0.617 | 0.356 | 0.632 | 0.367 | 0.493 | 0.339 | 0.464 | 0.320 | |
| Avg | 0.428 | 0.282 | 0.647 | 0.370 | 0.662 | 0.380 | 0.524 | 0.355 | 0.492 | 0.333 | ||
| Weather | Original | 96 | 0.395 | 0.427 | 0.689 | 0.596 | 0.300 | 0.384 | 0.182 | 0.233 | 0.388 | 0.425 |
| 192 | 0.619 | 0.560 | 0.752 | 0.638 | 0.598 | 0.544 | 0.250 | 0.288 | 0.619 | 0.560 | ||
| Avg | 0.657 | 0.572 | 0.803 | 0.656 | 0.634 | 0.548 | 0.286 | 0.308 | 0.659 | 0.574 | ||
| +Inverted | 96 | 0.174 | 0.214 | 0.169 | 0.225 | 0.180 | 0.251 | 0.183 | 0.223 | 0.177 | 0.218 | |
| Avg | 0.258 | 0.279 | 0.248 | 0.292 | 0.271 | 0.330 | 0.266 | 0.285 | 0.262 | 0.282 | ||
6.7.3. Full Results of Variate Generalization (Table 18)
The following are the results from Table 18 of the original paper:
| Datasets | Models | ECL | Traffic | Solar-Energy | PEMS03 | PEMS04 | PEMS08 | ||||||
| MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | ||
| Transformer | CI-Transformer | 0.366 | 0.435 | 0.720 | 0.428 | 0.301 | 0.336 | 0.170 | 0.279 | 0.164 | 0.278 | 0.168 | 0.277 |
| 1.238 | 0.825 | 1.082 | 0.771 | 0.490 | 0.485 | 0.260 | 0.363 | 0.260 | 0.360 | 0.275 | 0.380 | ||
| iTransformer | 0.183 | 0.271 | 0.432 | 0.284 | 0.242 | 0.270 | 0.113 | 0.221 | 0.111 | 0.221 | 0.101 | 0.204 | |
| 0.225 | 0.317 | 0.467 | 0.302 | 0.249 | 0.275 | 0.164 | 0.275 | 0.150 | 0.262 | 0.139 | 0.245 | ||
| Informer | CI-Informer | 0.354 | 0.430 | 0.767 | 0.431 | 0.309 | 0.342 | 0.178 | 0.288 | 0.175 | 0.287 | 0.172 | 0.282 |
| 1.282 | 0.835 | 1.077 | 0.749 | 0.485 | 0.482 | 0.270 | 0.366 | 0.268 | 0.363 | 0.280 | 0.378 | ||
| iTransformer | 0.220 | 0.313 | 0.658 | 0.370 | 0.297 | 0.334 | 0.150 | 0.262 | 0.147 | 0.259 | 0.139 | 0.247 | |
| 0.216 | 0.311 | 0.662 | 0.380 | 0.311 | 0.347 | 0.169 | 0.281 | 0.169 | 0.281 | 0.165 | 0.276 | ||
| Reformer | CI-Reformer | 0.351 | 0.428 | 0.749 | 0.426 | 0.312 | 0.344 | 0.178 | 0.288 | 0.175 | 0.286 | 0.173 | 0.283 |
| 1.258 | 0.834 | 1.073 | 0.748 | 0.490 | 0.485 | 0.268 | 0.366 | 0.265 | 0.362 | 0.280 | 0.378 | ||
| iTransformer | 0.208 | 0.301 | 0.647 | 0.370 | 0.248 | 0.292 | 0.180 | 0.291 | 0.175 | 0.286 | 0.173 | 0.283 | |
| 0.208 | 0.301 | 0.647 | 0.370 | 0.292 | 0.339 | 0.180 | 0.291 | 0.180 | 0.291 | 0.180 | 0.291 | ||
| Flowformer | CI-Flowformer | 0.288 | 0.373 | 0.709 | 0.419 | 0.275 | 0.315 | 0.162 | 0.273 | 0.158 | 0.270 | 0.159 | 0.267 |
| 1.082 | 0.758 | 1.020 | 0.718 | 0.432 | 0.443 | 0.252 | 0.354 | 0.249 | 0.350 | 0.260 | 0.360 | ||
| iTransformer | 0.210 | 0.293 | 0.524 | 0.355 | 0.266 | 0.285 | 0.147 | 0.248 | 0.142 | 0.240 | 0.129 | 0.241 | |
| 0.210 | 0.293 | 0.524 | 0.355 | 0.285 | 0.315 | 0.147 | 0.248 | 0.142 | 0.240 | 0.129 | 0.241 | ||
6.7.4. Full Forecasting Results - PEMS (Table 9)
The following are the results from Table 9 of the original paper:
| Models | iTransformer (Ours) | RLinear (2023) | PatchTST (2023) | Crossformer (2023) | TiDE (2023) | TimesNet (2023) | DLinear (2023) | SCINet (2022a) | FEDformer (2022) | Stationary (2022b) | Autoformer (2021) | ||||||||||||||||||
| | MSE MAE | | |||||||||||||||||||||||||||||
| PEMS03 | | 0.113 0.221 | | 0.180 0.291 | 0.169 0.281 | 0.326 0.419 | 0.147 0.248 | 0.278 0.375 | 0.114 0.224 | 0.213 0.327 | 0.147 0.249 | 0.667 0.601 | 0.742 0.627 | ||||||||||||||||||
| PEMS04 | | 0.111 0.221 | | 0.195 0.307 | 0.209 0.314 | 0.353 0.437 | 0.129 0.241 | 0.295 0.388 | 0.092 0.202 | 0.231 0.337 | 0.127 0.240 | 0.610 0.590 | 0.610 0.589 | ||||||||||||||||||
| PEMS07 | | 0.101 0.204 | | 0.211 0.303 | 0.235 0.315 | 0.441 0.464 | 0.158 0.244 | 0.280 0.358 | 0.100 0.210 | 0.201 0.296 | 0.119 0.225 | 0.504 0.478 | 0.817 0.645 | ||||||||||||||||||
| PEMS08 | | 0.150 0.226 | | 0.260 0.379 | 0.307 0.390 | 0.487 0.509 | 0.193 0.270 | 0.379 0.416 | 0.158 0.244 | 0.280 0.358 | 0.201 0.276 | 0.817 0.645 | 0.817 0.645 | ||||||||||||||||||
| 1st Count | | 13 | 13 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | |
6.7.5. Full Forecasting Results - Long-term (Table 10)
The following are the results from Table 10 of the original paper:
| Models | iTransformer (Ours) | RLinear (2023) | PatchTST (2023) | Crossformer (2023) | TiDE (2023) | TimesNet (2023) | DLinear (2023) | SCINet (2022a) | FEDformer (2022) | Stationary (2022b) | Autoformer (2021) | ||||||||||
| | MSE MAE | | MSE MAE | MSE MAE | MSE MAE | MSE MAE | MSE MAE | MSE MAE | MSE MAE | MSE MAE | MSE MAE | MSE MAE | |||||||||||
| ETTm1 | | 0.288 0.332 | | 0.286 0.327 | 0.281 0.326 | 0.757 0.610 | 0.358 0.404 | 0.291 0.333 | 0.350 0.400 | 0.571 0.537 | 0.305 0.349 | 0.306 0.347 | 0.327 0.370 | ||||||||||
| ETTh2 | | 0.383 0.407 | | 0.374 0.398 | 0.387 0.407 | 0.942 0.684 | 0.611 0.550 | 0.414 0.427 | 0.559 0.510 | 0.954 0.723 | 0.437 0.449 | 0.526 0.516 | 0.450 0.450 | ||||||||||
| ECL | | 0.178 0.270 | | 0.219 0.298 | 0.205 0.290 | 0.244 0.334 | 0.251 0.340 | 0.219 0.298 | 0.240 0.330 | 0.250 0.339 | 0.290 0.380 | 0.254 0.361 | 0.222 0.321 | ||||||||||
| Exchange | | 0.360 0.403 | | 0.378 0.417 | 0.367 0.404 | 0.940 0.707 | 0.370 0.417 | 0.354 0.414 | 0.358 0.404 | 0.780 0.619 | 0.461 0.454 | 0.610 0.570 | 0.804 0.637 | ||||||||||
| Traffic | | 0.428 0.282 | | 0.626 0.378 | 0.481 0.304 | 0.550 0.304 | 0.476 0.300 | 0.650 0.396 | 0.580 0.366 | 0.612 0.338 | 0.756 0.474 | 0.617 0.336 | 0.789 0.505 | ||||||||||
| Weather | | 0.258 0.278 | | 0.272 0.291 | 0.259 0.281 | 0.259 0.315 | 0.260 0.280 | 0.260 0.292 | 0.259 0.280 | 0.363 0.330 | 0.360 0.336 | 0.292 0.280 | 0.360 0.330 | ||||||||||
| Solar-Energy | | 0.233 0.262 | | 0.270 0.307 | 0.261 0.301 | 0.641 0.639 | 0.347 0.417 | 0.290 0.330 | 0.301 0.340 | 0.639 0.639 | 0.381 0.381 | 0.700 0.601 | 0.801 0.610 | ||||||||||
| 1st Count | | 16 | 22 | 6 | 12 | 12 | 11 | 3 | 0 | 0 | 1 | 0 | 3 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | |
6.7.6. Full Forecasting Results - Market (Table 11)
The following are the results from Table 11 of the original paper:
| Models | iTransformer (Ours) | RLinear (2023) | PatchTST (2023) | Crossformer (2023) | TiDE (2023) | TimesNet (2023) | DLinear (2023) | SCINet (2022a) | FEDformer (2022) | Stationary (2022b) | Autoformer (2021) | ||||||||||
| MSE MAE | MSE MAE | MSE MAE | MSE MAE | MSE MAE | MSE MAE | MSE MAE | MSE MAE | MSE MAE | MSE MAE | MSE MAE | |||||||||||
| | MS AE | | SE AE | MS MAE | MS MAE | MS MAE | MS MAE | MS MAE | MS MAE | MS MAE | MS MAE | MS MAE | |||||||||||
| Market-Merchant | | 0.072 0.147 | | 0.152 0.247 | 0.084 0.171 | 0.117 0.181 | 0.187 0.289 | 0.093 0.184 | 0.110 0.206 | 0.316 0.385 | 0.296 0.401 | 0.172 0.277 | 0.494 0.531 | ||||||||||
| Market-Wealth | | 0.345 0.289 | | 0.585 0.461 | 0.394 0.260 | 0.429 0.288 | 0.595 0.481 | 0.360 0.318 | 0.501 0.412 | 0.660 0.514 | 0.625 0.543 | 0.499 0.415 | 0.772 0.612 | ||||||||||
| Market-Finance | | 0.184 0.216 | | 0.395 0.336 | 0.210 0.248 | 5.333 0.618 | 0.987 0.442 | 0.516 0.308 | 0.765 0.372 | 2.817 0.734 | 1.621 0.569 | 1.368 0.643 | 1.872 0.681 | ||||||||||
| Market-Terminal | | 0.065 0.150 | | 0.180 0.286 | 0.077 0.179 | 0.071 0.162 | 0.216 0.311 | 0.080 0.179 | 0.106 0.210 | 0.280 0.360 | 0.295 0.403 | 0.180 0.296 | 0.518 0.547 | ||||||||||
| Market-Payment | | 0.072 0.144 | | 0.143 0.245 | 0.084 0.174 | 0.207 0.179 | 0.208 0.278 | 0.105 0.182 | 0.116 0.200 | 0.288 0.322 | 0.300 0.330 | 0.166 0.271 | 0.417 0.460 | ||||||||||
| Market-Customer | | 0.094 0.150 | | 0.214 0.261 | 0.118 0.180 | 0.309 0.194 | 0.308 0.307 | 0.142 0.191 | 0.184 0.219 | 0.461 0.385 | 0.350 0.391 | 0.242 0.301 | 0.669 0.593 | ||||||||||
| 1st Count | 28 | 27 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 20 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper successfully introduces iTransformer, a novel framework for multivariate time series forecasting that re-conceptualizes the application of Transformer components without modifying their inherent design. By inverting the dimensions on which attention and feed-forward networks operate, iTransformer treats individual time series (variates) as tokens. This allows the self-attention mechanism to effectively capture multivariate correlations between different series, while the feed-forward network processes each variate token to learn robust series-global representations. The model achieves state-of-the-art performance across a diverse range of real-world datasets, demonstrating significant improvements over existing Transformer-based and linear forecasting models. Furthermore, iTransformer exhibits enhanced generalization ability to unseen variates and more effectively utilizes arbitrary lookback windows, addressing key limitations of prior Transformer applications in time series.
7.2. Limitations & Future Work
The authors acknowledge that while iTransformer provides a robust backbone, there is still room for improvement and exploration:
- Large-scale Pre-training: The promising
generalization abilitiesofiTransformersuggest its potential forlarge-scale pre-trainingon diversemultivariate time seriesdatasets. This could lead to powerfulfoundation modelsfor time series, similar to those in NLP and computer vision. - More Time Series Analysis Tasks: Future work can explore adapting
iTransformerfor other time series analysis tasks beyond forecasting, such asanomaly detection,imputation, orclassification. - Enhanced Inductive Bias in Embedding: The current
embedding layeris a simpleMLP. The authors suggest that enhancing this embedding with moreinductive bias(e.g., usingTCNcomponents) could make the model even more robust to various real-world scenarios (likeirregular time series) while maintaining the flexibility of theTransformerarchitecture. - Specialized Attention and FFN: Although
iTransformeruses native components, the authors hint thatelaborately designed components(e.g., more efficient attention for multivariate correlation,temporal dependency modelingunderdistribution shifts,fine-grained variate tokenization) could be developed based on theinverted architectureto further boost performance.
7.3. Personal Insights & Critique
iTransformer offers a refreshingly simple yet profoundly impactful insight: the power of Transformers doesn't necessarily lie in complex modifications of its core mechanisms, but rather in a judicious re-evaluation of how inputs are tokenized and on which dimensions its components operate. This inverted perspective is a paradigm shift that could fundamentally change how Transformers are applied to multivariate time series.
Inspirations and Applications:
- Foundation Models for Time Series: The demonstrated
variate generalizationis particularly exciting. It strongly suggests thatiTransformercould be a cornerstone for developingpre-trained foundation modelsfor time series. A single model, trained on a vast and diverse collection ofmultivariate time series(even with varying numbers ofvariates), could then be fine-tuned or directly applied to new, unseen series, much likeBERTorGPTfor text. This would be a significant leap for the field, currently hampered by the need for task-specific models. - Robustness to Real-world Data: The paper effectively addresses common real-world complexities like
time lags,inconsistent measurements, andnon-stationarity. By normalizing eachvariate tokenand using attention formultivariate correlations,iTransformerprovides a more robust solution compared to models that assume perfect synchronization or homogeneity betweenvariates. - Unifying Linear and Transformer Strengths: The design cleverly integrates the observed strengths of
linear models(effectivetemporal modeling) into theFFNapplied tovariate tokens, while retaining theattention mechanismformultivariate interaction. This hybrid approach potentially offers the best of both worlds.
Potential Issues or Areas for Improvement:
-
Computational Cost for Very High Variate Count: While the paper suggests using
efficient attention mechanismsorpartial variate trainingto mitigate the complexity of attention for a large number ofvariates(), this quadratic dependency on remains a theoretical bottleneck for truly massive multivariate datasets (e.g., ) ifdense attentionis used. Exploring alternativesparse attentionstrategies intrinsically designed forvariate-wise interactioncould be beneficial. -
Implicit Temporal Ordering: The paper states that
positional encodingis not needed becausetemporal orderis implicitly stored within thevariate token(as it's an embedding of the whole series). While this might be true for the currentMLP embedding, the explicittemporal positionmight still be valuable for some highlytemporal-sensitivetasks or for longerlookback windowswhere fine-grainedtemporal informationcould be lost in a compressedvariate token. Further investigation intotemporal encodingwithin thevariate tokencould be an area of study. -
Understanding
Variate TokenSemantics: While the attention maps are shown to be interpretable formultivariate correlations, a deeper dive into the exactsemantic contentlearned by theMLP embeddingfor eachvariate tokencould provide more insights. What "intrinsic properties" (amplitude, periodicity, frequency spectrums) are truly being captured, and can these be explicitly controlled or influenced?Overall,
iTransformerrepresents a strong argument for re-examining fundamental architectural choices in applying powerfuldeep learning modelsto specific data modalities. Its success on diverse benchmarks and its potential for fosteringfoundation modelsmake it a highly significant contribution totime series forecasting.
Similar papers
Recommended via semantic vector search.