Paper status: completed

iTransformer: Inverted Transformers Are Effective for Time Series Forecasting

Published:10/10/2023
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
7 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

iTransformer inverts Transformer dimensions to embed time points into variate tokens, enhancing multivariate correlation modeling and nonlinear representation, addressing large lookback window challenges and achieving state-of-the-art time series forecasting performance and gener

Abstract

The recent boom of linear forecasting models questions the ongoing passion for architectural modifications of Transformer-based forecasters. These forecasters leverage Transformers to model the global dependencies over temporal tokens of time series, with each token formed by multiple variates of the same timestamp. However, Transformers are challenged in forecasting series with larger lookback windows due to performance degradation and computation explosion. Besides, the embedding for each temporal token fuses multiple variates that represent potential delayed events and distinct physical measurements, which may fail in learning variate-centric representations and result in meaningless attention maps. In this work, we reflect on the competent duties of Transformer components and repurpose the Transformer architecture without any modification to the basic components. We propose iTransformer that simply applies the attention and feed-forward network on the inverted dimensions. Specifically, the time points of individual series are embedded into variate tokens which are utilized by the attention mechanism to capture multivariate correlations; meanwhile, the feed-forward network is applied for each variate token to learn nonlinear representations. The iTransformer model achieves state-of-the-art on challenging real-world datasets, which further empowers the Transformer family with promoted performance, generalization ability across different variates, and better utilization of arbitrary lookback windows, making it a nice alternative as the fundamental backbone of time series forecasting. Code is available at this repository: https://github.com/thuml/iTransformer.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

iTransformer: Inverted Transformers Are Effective for Time Series Forecasting

1.2. Authors

Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, Mingsheng Long. Affiliations include the School of Software, BNRist, Tsinghua University, Beijing, China, and Ant Group, Hangzhou, China. The authors are primarily affiliated with Tsinghua University, a prominent research institution, with some also associated with Ant Group, a major technology company, suggesting a blend of academic rigor and industrial relevance in their research.

1.3. Journal/Conference

The paper was published on arXiv, a preprint server, on 2023-10-10T13:44:09.000Z. arXiv is widely used in the academic community for disseminating research rapidly before peer review or formal publication. While not a peer-reviewed journal or conference in itself, papers on arXiv often precede submissions to prestigious venues.

1.4. Publication Year

2023

1.5. Abstract

The paper addresses the challenges faced by traditional Transformer-based time series forecasters, which often model global dependencies over temporal tokens (multiple variates at the same timestamp). These challenges include performance degradation and computational explosion with larger lookback windows, and difficulties in learning variate-centric representations due to the fusion of distinct measurements within temporal tokens. The authors propose iTransformer, which repurposes the Transformer architecture by inverting the dimensions: attention and the feed-forward network (FFN) are applied on the inverted dimensions. Specifically, the time points of individual series are embedded into variate tokens, allowing the attention mechanism to capture multivariate correlations, while the FFN processes each variate token to learn nonlinear representations. iTransformer achieves state-of-the-art performance on real-world datasets, demonstrating promoted performance, generalization ability across different variates, and better utilization of arbitrary lookback windows, positioning it as a fundamental backbone for time series forecasting.

https://arxiv.org/abs/2310.06625 The publication status is a preprint on arXiv. PDF Link: https://arxiv.org/pdf/2310.06625v4.pdf

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the inefficiency and performance limitations of existing Transformer-based models when applied to multivariate time series forecasting. Traditional Transformer models, drawing inspiration from natural language processing, often treat each timestamp as a "token," embedding multiple variates (different measurements or features) from that single time point into one combined representation. This approach has several critical drawbacks in time series forecasting:

  1. Performance Degradation with Large Lookback Windows: As the historical data (lookback window) increases, Transformer performance can degrade, and computational costs (especially for self-attention) explode quadratically.

  2. Meaningless Attention Maps: Fusing multiple variates (which might represent delayed events or distinct physical measurements) into a single temporal token can lead to meaningless attention maps, as the model struggles to learn variate-centric representations. The attention mechanism might not effectively capture true dependencies when tokens are not semantically cohesive.

  3. Competition from Linear Models: The recent success of simpler linear forecasting models (like DLinear or TiDE) which often outperform complex Transformer architectures, further questions the efficacy of current Transformer adaptations for time series. This suggests that the issue might not be with the Transformer's core components but with how they are applied to time series data.

    The paper's entry point or innovative idea is to re-evaluate the fundamental application of Transformer components to time series by inverting the dimensions. Instead of processing time-aligned (temporal) tokens, iTransformer processes variate-aligned tokens, essentially treating each individual time series (corresponding to one variate) as a token. This inverted perspective aims to address the aforementioned issues by better aligning the Transformer's strengths (capturing dependencies) with the inherent structure of multivariate time series data.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

  1. Repurposed Transformer Architecture (iTransformer): It proposes iTransformer, a novel framework that applies native Transformer components (self-attention, feed-forward networks, layer normalization) on inverted dimensions. Instead of temporal tokens, it treats independent time series of each variate as tokens. This is a fundamental architectural shift without modifying the core Transformer components themselves.

  2. Enhanced Multivariate Correlation Capture: By embedding independent time series as variate tokens, the self-attention mechanism is naturally repurposed to capture multivariate correlations (dependencies between different variates) across their entire historical sequences, leading to more interpretable attention maps.

  3. Improved Series Representation Learning: The feed-forward network (FFN) is applied to each variate token independently, allowing it to learn series-global representations for individual time series, which are then used for forecasting future steps. This is argued to be more effective than FFNs on temporal tokens which are often too localized.

  4. State-of-the-Art Performance: iTransformer achieves comprehensive state-of-the-art performance on challenging real-world datasets, outperforming many advanced Transformer-based and linear forecasting models.

  5. Promoted Performance and Generalization: The inverted framework consistently improves the performance of various Transformer variants (e.g., Transformer, Reformer, Informer, Flowformer, Flashformer). It also demonstrates strong generalization ability on unseen variates and effectively utilizes arbitrary lookback windows for more precise predictions, addressing key pain points of previous Transformer-based forecasters.

    These findings collectively solve the problems of performance degradation with larger lookback windows, computational explosion for vanilla Transformers, and the generation of meaningless attention maps by providing a more appropriate and efficient way to apply Transformer components to multivariate time series data.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand iTransformer and its innovations, a foundational grasp of Transformers and Time Series Forecasting is essential.

3.1.1. Time Series Forecasting

Time series forecasting involves predicting future values of a variable based on its past observations.

  • Univariate Time Series: A single sequence of observations recorded over time.
  • Multivariate Time Series: Multiple related time series observed simultaneously. For example, in a weather dataset, temperature, humidity, and wind speed are distinct variates (variables) that form a multivariate time series.
  • Lookback Window (Input Length): The length of historical data used to make a prediction. Often denoted as TT.
  • Prediction Length (Horizon): The number of future time steps to be predicted. Often denoted as SS.

3.1.2. Transformer Architecture

The Transformer (Vaswani et al., 2017) is a neural network architecture primarily based on the self-attention mechanism. It was originally developed for natural language processing (NLP) tasks but has since been adapted for various domains, including computer vision and time series analysis.

The vanilla Transformer typically consists of an encoder and a decoder. iTransformer uses an encoder-only architecture. Key components of a Transformer encoder block include:

  • Tokens and Embeddings: Input sequences (e.g., words in NLP, time points in time series) are first converted into numerical representations called embeddings. These embeddings are then treated as tokens that the Transformer processes.
  • Positional Encoding: Since Transformers process input tokens in parallel without inherent sequence order, positional encodings are added to the embeddings to inject information about the relative or absolute position of tokens in the sequence. This is crucial for tasks where order matters, like time series.
  • Multi-Head Self-Attention (MHSA): This is the core mechanism that allows the Transformer to weigh the importance of different tokens in the input sequence when processing each token.
    • For each token, three vectors are created: Query (QQ), Key (KK), and Value (VV). These are typically derived by linearly transforming the token's embedding.
    • The attention score between a Query and a Key represents how much the Query should "pay attention" to the Key. These scores are calculated for all Query-Key pairs.
    • The scores are then scaled by dk\sqrt{d_k} (where dkd_k is the dimension of Key vectors) and passed through a softmax function to obtain attention weights.
    • Finally, these weights are used to compute a weighted sum of the Value vectors, producing the output of the attention mechanism.
    • Multi-head means this process is repeated multiple times in parallel with different linear transformations, and the results are concatenated and linearly transformed again. This allows the model to capture different types of relationships.
    • The standard formula for Self-Attention is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
      • QQ (Query), KK (Key), VV (Value) are matrices representing the input transformations.
      • dkd_k is the dimension of the key vectors.
      • QKTQK^T is the dot product of Query and Key matrices, forming the raw attention scores.
      • softmax\mathrm{softmax} normalizes the scores to create weights.
      • The output is a weighted sum of the Value vectors.
  • Feed-Forward Network (FFN): After the attention sub-layer, each token's output is passed through an identical, position-wise feed-forward network. This typically consists of two linear transformations with a ReLU activation in between. It processes each position (token) independently.
  • Layer Normalization (LayerNorm): Applied after each sub-layer (attention and FFN), LayerNorm normalizes the activations across the feature dimension for each input sample independently. This helps stabilize training and speed up convergence.
    • The LayerNorm operation for a vector xx is: $ \mathrm{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta $ Where:
      • μ\mu is the mean of the elements in xx.
      • σ2\sigma^2 is the variance of the elements in xx.
      • γ\gamma and β\beta are learnable scale and shift parameters.
      • ϵ\epsilon is a small constant for numerical stability.
  • Residual Connections: Residual connections (or skip connections) are used around each sub-layer, followed by LayerNorm. This helps with training deeper networks by allowing gradients to flow more easily. The output of each sub-layer is LayerNorm(x+Sublayer(x))\mathrm{LayerNorm}(x + \mathrm{Sublayer}(x)).

3.1.3. Multi-Layer Perceptron (MLP)

An MLP is a class of feedforward artificial neural networks. It consists of at least three layers of nodes: an input layer, a hidden layer, and an output layer. Each node (neuron) uses a nonlinear activation function, except for the input nodes. MLPs are capable of learning complex non-linear relationships. In iTransformer, MLPs are used for embedding and projection layers.

3.2. Previous Works

The paper categorizes previous Transformer-based forecasters into four categories based on modifications to components and architecture (Figure 3).

The following figure (Figure 3 from the original paper) shows Transformer-based forecasters categorized by component and architecture modifications:

Figure 3: Transformer-based forecasters categorized by component and architecture modifications. 该图像是图表,展示了基于Transformer的时间序列预测模型按组件和架构修改的分类。图中以二维坐标区分修改组件与修改架构,突出显示了iTransformer在架构和组件双重修改中的位置。

  1. Component Adaptation (e.g., attention module modification): This category, including works like Autoformer (Wu et al., 2021), Informer (Li et al., 2021), and FEDformer (Zhou et al., 2022), primarily focuses on modifying the attention module to improve efficiency (especially for long sequences) or adapt to temporal dependencies. For instance, Informer uses ProbSparse self-attention to reduce quadratic complexity. Autoformer introduces an auto-correlation mechanism for long-term series forecasting.

  2. Architectural Adaptation (e.g., inherent processing of time series): This category focuses on how time series data is prepared or structured for the Transformer. Examples include Stationarization (Liu et al., 2022b) which tackles non-stationarity, Channel Independence (Nie et al., 2023), and Patching (Nie et al., 2023). Patching involves splitting the input time series into smaller sub-series (patches) and treating these patches as tokens, aiming to enlarge the receptive field and capture local patterns. PatchTST (Nie et al., 2023) is a prominent example.

  3. Both Component and Architectural Modification: Crossformer (Zhang & Yan, 2023) falls into this category, explicitly capturing cross-time and cross-variate dependencies through renovated attention mechanisms and architecture.

  4. No Component Modification, only Architectural: iTransformer is presented as the first work in this category, keeping native Transformer components but inverting the dimensions they operate on.

    The paper also highlights the boom of linear forecasting models like N-BEATS (Oreshkin et al., 2019), DLinear (Zeng et al., 2023), TiDE (Das et al., 2023), and RLinear (Li et al., 2023). These models demonstrate that simple linear layers can achieve competitive or even superior performance and efficiency compared to complex Transformer architectures, challenging the "passion for architectural modifications of Transformer-based forecasters."

3.3. Technological Evolution

The field of time series forecasting has seen significant evolution:

  • Statistical Methods: Traditionally, methods like ARIMA (Box & Jenkins, 1968) and Exponential Smoothing dominated. These rely on statistical properties and often assume linearity or specific data distributions.
  • Recurrent Neural Networks (RNNs) and Temporal Convolutional Networks (TCNs): With the rise of deep learning, RNNs (like LSTMs by Zhao et al., 2017, and DeepAR by Salinas et al., 2020) and TCNs (Bai et al., 2018; SCINet by Liu et al., 2022a) became popular for their ability to model sequential data.
  • Transformers: Inspired by their success in NLP, Transformers were adapted for time series (Wu et al., 2021; Nie et al., 2023). Early adaptations often treated time points as tokens, similar to words in a sentence. However, this direct application revealed limitations, especially in multivariate settings where variates at the same timestamp might not be semantically related or may exhibit phase shifts.
  • Linear Models Resurgence: The recent strong performance of linear models has caused researchers to question the necessity and effectiveness of complex deep learning architectures for time series, especially the way Transformers were being applied.
  • Inverted Transformers (iTransformer): This paper fits into the latest wave of evolution by suggesting that the Transformer's core components are powerful, but their application to time series needs a fundamental re-thinking of tokenization and dimension. By inverting the tokenization from temporal to variate, iTransformer attempts to better align the Transformer's strengths with the intrinsic characteristics of multivariate time series.

3.4. Differentiation Analysis

iTransformer differentiates itself from other Transformer-based time series forecasters primarily by its novel tokenization strategy and the inversion of dimensions for Transformer components, while crucially not modifying the basic components themselves.

The following figure (Figure 2 from the original paper) shows a comparison between the vanilla Transformer (top) and the proposed iTransformer (bottom):

Figure 2: Comparison between the vanilla Transformer (top) and the proposed iTransformer (bottom). Transformer embeds the temporal token, which contains the multivariate representation of each time s… 该图像是论文中展示的示意图,比较了传统Transformer与所提iTransformer的时间序列编码方式差异。传统Transformer以时间步为token嵌入多变量,关注时间依赖;iTransformer则反转维度,以变量为token进行嵌入,关注多变量间的关联。

Here's a breakdown of the core differences and innovations compared to main methods:

  • Traditional Transformer-based Forecasters (e.g., Autoformer, Informer, FEDformer, vanilla Transformer adaptations):

    • Tokenization: These models typically embed multiple variates of the same timestamp into a single temporal token. Attention is then applied along the temporal dimension to capture time-wise dependencies.
    • Problems: As discussed in the paper, this can lead to malpositioned and too localized representations, meaningless attention maps (especially when variates at a given timestamp are unrelated or have phase lags), performance degradation with large lookback windows, and computational explosion.
    • Differentiation: iTransformer explicitly rejects this temporal tokenization. It views each entire time series of an individual variate as a token (variate token). Attention is then applied across the variate dimension to capture relationships between different variates. The feed-forward network (FFN) operates on the temporal dimension for each variate token to learn its series representation.
  • PatchTST (Nie et al., 2023):

    • Tokenization: PatchTST uses patching, where segments (patches) of a time series are treated as tokens. This is still primarily a temporal tokenization strategy, albeit with a broader receptive field than single time points.
    • Differentiation: While PatchTST improves upon vanilla Transformers by using patches, iTransformer's variate tokenization is a more fundamental shift. iTransformer argues that PatchTST's patching mechanism might "lose focus on specific locality to handle rapid fluctuation" and can still introduce "interaction noises between time-unaligned patches from different multivariate." iTransformer's approach of treating a whole series as a token and using attention for multivariate correlation is distinct.
  • Crossformer (Zhang & Yan, 2023):

    • Tokenization/Mechanism: Crossformer explicitly aims to capture both cross-time and cross-variate dependencies, often through modified attention mechanisms and architecture.
    • Differentiation: iTransformer argues that Crossformer's explicit dual-dependency capture can still suffer from "interaction of time-unaligned patches from different multivariate" leading to "unnecessary noise." iTransformer provides a cleaner separation of duties: self-attention for multivariate correlations and FFN for temporal representations within each variate.
  • Linear-based models (e.g., DLinear, TiDE, RLinear):

    • Architecture: These are typically much simpler models, often relying on linear layers or MLPs for forecasting. They highlight the effectiveness of simple linear mappings for temporal dependencies.

    • Differentiation: iTransformer acknowledges the power of linear models, particularly for temporal dependencies. It integrates this insight by having the feed-forward network (an MLP-like component) process the temporal dimension of each variate token. This effectively leverages the temporal modeling capabilities of linear layers while retaining the multivariate correlation power of attention on the variate dimension.

      In summary, iTransformer's core innovation lies in its inverted perspective, which fundamentally re-thinks the tokenization and dimension assignments for Transformer components in time series forecasting, leading to a more natural and effective alignment with the data's inherent structure.

4. Methodology

4.1. Principles

The core principle behind iTransformer is the inversion of how Transformer components (specifically self-attention and the feed-forward network) are applied to multivariate time series data, without modifying the intrinsic design of these components. The authors argue that previous Transformer-based forecasters incorrectly apply attention along the temporal dimension on temporal tokens (which combine all variates at a single timestamp). This approach leads to issues like meaningless attention maps due to heterogeneous variates and degraded performance with long sequences.

iTransformer's intuition is that:

  1. Variate-centric Representations: Each individual time series (corresponding to a single variate) should be treated as a coherent unit or token. This allows for learning variate-centric representations that respect the distinct physical meanings and statistical distributions of each series.

  2. Multivariate Correlations: The self-attention mechanism, which is excellent at capturing pairwise dependencies, should be applied across these variate tokens to model multivariate correlations (i.e., how different variates interact with each other over time). This provides a more interpretable and meaningful attention map.

  3. Temporal Representations: The feed-forward network (FFN), which is proficient at learning nonlinear representations from individual inputs, should be applied to the temporal dimension of each variate token to extract complex patterns and representations for that specific time series. This leverages the strengths of MLP-based models for temporal modeling.

  4. Layer Normalization for Discrepancies: Layer normalization should be applied to the series representation of individual variates to reduce discrepancies caused by inconsistent measurements and tackle non-stationary problems.

    By adopting this inverted view, iTransformer aims to build a more robust, generalizable, and performant Transformer backbone for multivariate time series forecasting.

4.2. Core Methodology In-depth (Layer by Layer)

iTransformer adopts an encoder-only architecture, similar to the encoder of a vanilla Transformer, but with a crucial inversion of the dimensions that its core modules operate on.

The following figure (Figure 4 from the original paper) shows the overall structure of iTransformer:

Figure 4: Overall structure of iTransformer, which shares the same modular arrangement with the encoder of Transformer. (a) Raw series of different variates are independently embedded as tokens. (b)… 该图像是图4的结构示意图,展示了iTransformer的整体架构。图中包含(a)原始多变量时间序列嵌入为变异变量令牌,(b)多变量自注意力机制及相关矩阵计算过程,(c)共享前馈网络提取序列表示,以及(d)采用时间层归一化降低变量间差异。

Given historical observations of a multivariate time series XRT×N\mathbf{X} \in \mathbb{R}^{T \times N}, where TT is the number of time steps (lookback length) and NN is the number of variates, the goal is to predict future SS time steps YRS×N\mathbf{Y} \in \mathbb{R}^{S \times N}.

The overall process for iTransformer to predict future series Y^:,n\hat{\mathbf{Y}}_{:,n} for a specific variate nn based on its lookback series X:,n\mathbf{X}_{:,n} is formulated as follows: $ \begin{array}{r} \mathbf{h}{n}^{0} = \operatorname{Embedding}(\mathbf{X}{:,n}), \qquad \ \mathbf{H}^{l+1} = \operatorname{TrmBlock}(\mathbf{H}^{l}), l = 0, \dots, L-1, \ \hat{\mathbf{Y}}{:,n} = \operatorname{Projection}(\mathbf{h}{n}^{L}), \qquad \end{array} $ Where:

  • X:,nRT\mathbf{X}_{:,n} \in \mathbb{R}^{T} represents the entire time series for the nn-th variate over the lookback window TT.

  • Embedding()\operatorname{Embedding}(\cdot) is a function that transforms the raw time series of a variate into a higher-dimensional variate token representation.

  • hn0RD\mathbf{h}_{n}^{0} \in \mathbb{R}^{D} is the initial embedded variate token for the nn-th variate, where DD is the dimension of the token (model dimension).

  • H={h1,,hN}RN×D\mathbf{H} = \{\mathbf{h}_1, \dots, \mathbf{h}_N\} \in \mathbb{R}^{N \times D} is the collection of all NN embedded variate tokens.

  • TrmBlock()\operatorname{TrmBlock}(\cdot) represents a single iTransformer block, which processes the variate tokens.

  • ll denotes the layer index, ranging from 0 to L-1, where LL is the total number of iTransformer blocks.

  • hnL\mathbf{h}_{n}^{L} is the final representation of the nn-th variate token after passing through LL blocks.

  • Projection()\operatorname{Projection}(\cdot) is a function that transforms the final variate token representation back into the predicted future series for that variate.

  • Y^:,nRS\hat{\mathbf{Y}}_{:,n} \in \mathbb{R}^{S} is the predicted future series for the nn-th variate over the prediction length SS.

    Let's break down each component:

4.2.1. Embedding Layer

  • Input: The raw time series for each individual variate. For a specific variate nn, this is X:,nRT\mathbf{X}_{:,n} \in \mathbb{R}^{T}.
  • Process: This layer independently embeds each variate's time series into a high-dimensional variate token.
  • Implementation: It is implemented by a Multi-Layer Perceptron (MLP). An MLP can learn complex nonlinear representations from the entire sequence of TT time points for a single variate.
  • Output: hn0RD\mathbf{h}_{n}^{0} \in \mathbb{R}^{D}, which is the initial variate token of dimension DD for the nn-th variate.
  • Role: This step transforms each individual time series into a semantically rich variate token that aggregates global representations of the series, ensuring variate-centricity from the start.

4.2.2. iTransformer Block (TrmBlock)

Each iTransformer block consists of Layer Normalization, Self-Attention, and a Feed-Forward Network, arranged in a typical Transformer encoder block structure (pre-normalization style). These components, however, operate on the inverted dimensions.

4.2.2.1. Layer Normalization

  • Role in iTransformer: Unlike traditional Transformer-based forecasters where LayerNorm might normalize across variates at a single timestamp, in iTransformer, LayerNorm is applied to the series representation of individual variates. This is crucial for handling non-stationarity and reducing discrepancies due to inconsistent measurements across different variates. By normalizing each variate token to a Gaussian distribution, it helps ensure that the self-attention mechanism is not unduly influenced by scale differences between variates.
  • Formula: For a collection of variate tokens H={h1,,hN}\mathbf{H} = \{\mathbf{h}_1, \dots, \mathbf{h}_N\}, the LayerNorm operation is applied to each hn\mathbf{h}_n independently: $ \mathrm{LayerNorm}(\mathbf{H}) = \left{ \frac { \mathbf { h } _ { n } - \mathrm { Mean } ( \mathbf { h } _ { n } ) } { \sqrt { \mathrm { Var } ( \mathbf { h } _ { n } ) + \epsilon } } \bigg | \ n = 1 , \ldots , N \right} $ Where:
    • hnRD\mathbf{h}_n \in \mathbb{R}^D is the variate token (series representation) for the nn-th variate.
    • Mean(hn)\mathrm{Mean}(\mathbf{h}_n) is the mean of the elements in the vector hn\mathbf{h}_n.
    • Var(hn)\mathrm{Var}(\mathbf{h}_n) is the variance of the elements in the vector hn\mathbf{h}_n.
    • ϵ\epsilon is a small constant added for numerical stability.
    • The output is a set of normalized variate tokens.

4.2.2.2. Self-Attention

  • Role in iTransformer: In iTransformer, self-attention is applied across the variate dimension to capture multivariate correlations. Each variate token hn\mathbf{h}_n acts as an input to the attention mechanism.
  • Input: The set of normalized variate tokens HRN×D\mathbf{H} \in \mathbb{R}^{N \times D}.
  • Process: Queries (QQ), Keys (KK), and Values (VV) are generated from H\mathbf{H}. The attention mechanism then calculates pairwise similarity between variate tokens to determine their interdependencies.
  • Formula (standard self-attention applied on variate tokens): $ \mathrm{Attention}(\mathbf{H}) = \mathrm{softmax}\left(\frac{(\mathbf{H}W_Q)(\mathbf{H}W_K)^T}{\sqrt{d_k}}\right)(\mathbf{H}W_V) $ Where:
    • HRN×D\mathbf{H} \in \mathbb{R}^{N \times D} is the matrix of variate tokens from the previous layer.
    • WQ,WK,WVRD×dkW_Q, W_K, W_V \in \mathbb{R}^{D \times d_k} are learnable weight matrices used to project H\mathbf{H} into Query, Key, and Value matrices respectively, each of dimension dkd_k.
    • NN is the number of variate tokens.
    • dkd_k is the dimension of the projected Query and Key vectors.
    • The resulting attention map ARN×N\mathbf{A} \in \mathbb{R}^{N \times N} (before softmax) has entries Ai,jqiTkj\mathbf{A}_{i,j} \propto \mathbf{q}_i^T \mathbf{k}_j, where qi\mathbf{q}_i and kj\mathbf{k}_j are Query and Key vectors for variates ii and jj. This directly reflects the correlation between variate ii and variate jj.
  • Output: The updated variate tokens after self-attention, typically combined with residual connection and LayerNorm.
  • Enhanced Interpretability: This design means that the attention weights directly reflect the correlations between different variates, making the attention map highly interpretable for understanding multivariate correlations.

4.2.2.3. Feed-Forward Network (FFN)

  • Role in iTransformer: The FFN is applied independently to each variate token hn\mathbf{h}_n, effectively operating on the temporal dimension (implicitly, as hn\mathbf{h}_n encodes the entire time series of variate nn).
  • Input: The output of the self-attention sub-layer (after LayerNorm and residual connection) for each variate token.
  • Process: Each variate token hn\mathbf{h}_n passes through an identical FFN, which typically consists of two linear transformations with an activation function (e.g., ReLU) in between.
  • Implementation: The FFN is shared across all variate tokens. This means the same FFN learns intrinsic patterns (like amplitude, periodicity, frequency spectrums) that are generalizable across different time series.
  • Output: Enhanced variate tokens that have learned richer nonlinear representations from their respective time series.
  • No Positional Encoding: Because each variate token hn\mathbf{h}_n already encodes the entire time series of its corresponding variate (implicitly capturing temporal order within its representation), and the FFN works identically on independent variates, explicit positional encoding (as used in vanilla Transformers for temporal tokens) is not needed for iTransformer.

4.2.3. Projection Layer

  • Input: The final variate tokens hnLRD\mathbf{h}_{n}^{L} \in \mathbb{R}^{D} after LL iTransformer blocks.
  • Process: This layer transforms each final variate token into its corresponding predicted future time series.
  • Implementation: It is implemented by an MLP.
  • Output: Y^:,nRS\hat{\mathbf{Y}}_{:,n} \in \mathbb{R}^{S}, which is the predicted future SS time steps for the nn-th variate.
  • Role: This step converts the learned variate-centric representations into concrete future forecasts. The fact that a simple linear layer (MLP) is sufficient here is consistent with the findings of successful linear forecasters.

4.2.4. Algorithm Overview

The pseudo-code for iTransformer summarizes these steps:

The following are the results from Table Algorithm 1 of the original paper:

Algorithm 1 iTransformer - Overall Architecture.
Require: Input lookback time series X RT ×N ; input Length T; predicted length S; variates
number N; token dimension D; iTransformer block number L.
1: X=X.transpose() >X RN×T
2: Multi-layer Perceptron works on the last dimension to embed series into variate tokens.
3: H0 = MLP(X) H0 RN×D
4: for l in {1, . . . , L}: Run through iTransformer blocks.
5: Self-attention layer is applied on variate tokens.
6: Hl−1 = LayerNorm(Hl−1 + Self−Attn(Hl−1))
7: Feed-forward network is utilized for series representations, broadcasting to each token. H RN×D
8: Hl = LayerNorm(Hl−1 + Feed-Forward(Hl−1))
9: LayerNorm is adopted on series representations to reduce variates discrepancies.
10: End for
11: = MLP(HL) Project tokens back to predicted series, RN×S
12: =.transpose()
13: Return Return the prediction result

Step-by-step Explanation of Algorithm 1:

  1. Input Data Transposition (Line 1):

    • The input lookback time series X\mathbf{X} is initially in shape RT×N\mathbb{R}^{T \times N} (Time steps ×\times Variates).
    • It is immediately transposed to XRN×T\mathbf{X} \in \mathbb{R}^{N \times T}. This reorients the data such that each row now represents an individual variate's entire historical time series, which is crucial for variate-centric tokenization.
  2. Embedding Layer (Lines 2-3):

    • A Multi-layer Perceptron (MLP) is applied to the transposed input X\mathbf{X}. This MLP operates on the last dimension (which is TT, the length of each variate's time series).
    • For each variate's time series Xn,:\mathbf{X}_{n,:} (row nn of the transposed X\mathbf{X}), the MLP generates an initial variate token hn0\mathbf{h}_n^0.
    • The output H0\mathbf{H}^0 is a matrix of shape RN×D\mathbb{R}^{N \times D}, where each row hn0\mathbf{h}_n^0 is a variate token of dimension DD.
  3. iTransformer Blocks (Lines 4-9):

    • The algorithm then iterates through LL iTransformer blocks (from l=1l=1 to LL). In Transformer terminology, Hl1H^{l-1} would be the input to the current block, and HlH^l the output.
    • Self-Attention Sub-layer (Lines 5-6):
      • The Self-Attention layer is applied to the current set of variate tokens Hl1\mathbf{H}^{l-1}. As explained, this attention mechanism captures multivariate correlations by computing relationships between different variate tokens.
      • A residual connection (+SelfAttn(Hl1)+ Self-Attn(H^{l-1})) is added, followed by Layer Normalization (LayerNorm(Hl1+...)LayerNorm(H^{l-1} + ...)). The output is updated back into Hl1H^{l-1} for the next sub-layer. (Note: The notation Hl1H^{l-1} for the output of the first sub-layer is common in pre-LN Transformer implementations, implying Hl1H^{l-1} is updated in-place before the FFN stage).
    • Feed-Forward Network Sub-layer (Lines 7-8):
      • The Feed-Forward Network (FFN) is then applied. This FFN processes each variate token independently to learn series representations. It is shared across all NN tokens.
      • Again, a residual connection is added (+FeedForward(Hl1)+ Feed-Forward(H^{l-1})), followed by Layer Normalization (LayerNorm(Hl1+...)LayerNorm(H^{l-1} + ...)).
      • The final output of the ll-th block is Hl\mathbf{H}^l, which becomes the input to the next block.
  4. Projection Layer (Lines 11-12):

    • After passing through all LL iTransformer blocks, the final variate tokens HL\mathbf{H}^L are fed into another MLP.
    • This MLP projects each variate token (of dimension DD) back into its predicted future series (of length SS). The output of this MLP is Y^RN×S\hat{\mathbf{Y}} \in \mathbb{R}^{N \times S}.
    • The result is then transposed back from RN×S\mathbb{R}^{N \times S} to RS×N\mathbb{R}^{S \times N} to match the conventional output format for time series forecasting, where rows are time steps and columns are variates.
  5. Return (Line 13):

    • The model returns the predicted future time series Y^\hat{\mathbf{Y}}.

      This detailed breakdown shows how iTransformer systematically re-configures the Transformer's internal processing flow to align with the characteristics of multivariate time series data, leveraging the strengths of its components in an inverted manner.

5. Experimental Setup

5.1. Datasets

The experiments in the paper used a total of 7 real-world public datasets and an additional set of 6 Market datasets.

The following are the results from Table 4 of the original paper:

Dataset Dim Prediction Length Dataset Size Frequency Information
ETTh1, ETTh2 7 {96, 192, 336, 720} (8545, 2881, 2881) Hourly Electricity
ETTm1, ETTm2 7 {96, 192, 336, 720} (34465, 11521, 11521) 15min Electricity
Exchange 8 {96, 192, 336, 720} (5120, 665, 1422) Daily Economy
Weather 21 {96, 192, 336, 720} (36792, 5271, 10540) 10min Weather
ECL 321 {96, 192, 336, 720} (18317, 2633, 5261) Hourly Electricity
Traffic 862 {96, 192, 336, 720} (12185, 1757, 3509) Hourly Transportation
Solar-Energy 137 {96, 192, 336, 720} (36601, 5161, 10417) 10min Energy
PEMS03 358 {12, 24, 48, 96} (15617, 5135, 5135) 5min Transportation
PEMS04 307 {12, 24, 48, 96} (10172, 3375, 3375) 5min Transportation
PEMS07 883 {12, 24, 48, 96} (16911, 5622, 5622) 5min Transportation
PEMS08 170 {12, 24, 48, 96} (10690, 3548, 3548) 5min Transportation
Market-Merchant 285 {12, 24, 72, 144} (7045, 1429, 1429) 10min Transaction
Market-Wealth 485 {12, 24, 72, 144} (7045, 1429, 1429) 10min Transaction
Market-Finance 405 {12, 24, 72, 144} (7045, 1429, 1429) 10min Transaction
Market-Terminal 307 {12, 24, 72, 144} (7045, 1429, 1429) 10min Transaction
Market-Payment 759 {12, 24, 72, 144} (7045, 1429, 1429) 10min Transaction
Market-Customer 395 {12, 24, 72, 144} (7045, 1429, 1429) 10min Transaction

Detailed descriptions:

  • ETT (ETTh1, ETTh2, ETTm1, ETTm2): Contains 7 factors related to electricity transformers. ETTh1 and ETTh2 are hourly, ETTm1 and ETTm2 are 15-minute sampled. These are commonly used for assessing long-term forecasting capabilities.

  • Exchange: Daily exchange rates from 8 countries. Represents economic time series.

  • Weather: 21 meteorological factors, sampled every 10 minutes. A typical multi-variate natural phenomenon dataset.

  • ECL (Electricity Consuming Load): Hourly electricity consumption data from 321 clients. High-dimensional electricity demand forecasting.

  • Traffic: Hourly road occupancy rates from 862 sensors in the San Francisco Bay area. A very high-dimensional transportation dataset with complex spatial-temporal dependencies.

  • Solar-Energy: Solar power production of 137 PV plants, sampled every 10 minutes. Energy forecasting.

  • PEMS (PEMS03, PEMS04, PEMS07, PEMS08): Public traffic network data in California, collected in 5-minute windows. Similar to Traffic, but often used for shorter prediction horizons.

  • Market (6 subsets): Minute-sampled server load of Alipay online transaction applications, with hundreds of variates (285 to 759). These are custom, high-dimensional, real-world transaction datasets, designed to test model performance in complex industrial settings.

    The datasets were chosen to cover a wide range of domains (electricity, economy, weather, transportation, energy, transaction), varying dimensions (from 7 to 883 variates), and different sampling frequencies, making them suitable for evaluating the robustness and generalizability of forecasting models. The inclusion of high-dimensional datasets like Traffic, ECL, PEMS, and Market is particularly important for iTransformer's claims about handling multivariate correlations.

Data processing and train-validation-test splits follow standard chronological division protocols to avoid data leakage. The lookback length (TT) is fixed at 96 for most datasets (except Market), and prediction lengths (SS) vary from {12, 24, 36, 48} for PEMS and {96, 192, 336, 720} for others. For Market data, T=144T=144 and SS varies in {12, 24, 72, 144}.

5.2. Evaluation Metrics

The paper uses two common metrics for time series forecasting: Mean Squared Error (MSE) and Mean Absolute Error (MAE). Lower values for both metrics indicate better forecasting accuracy.

  1. Mean Squared Error (MSE)

    • Conceptual Definition: MSE measures the average of the squares of the errors. It quantifies the average magnitude of the errors, where larger errors contribute disproportionately more to the total error due to the squaring. It is particularly sensitive to large errors or outliers.
    • Mathematical Formula: $ \mathrm{MSE} = \frac{1}{N \cdot S} \sum_{i=1}^{N} \sum_{t=1}^{S} (y_{i,t} - \hat{y}_{i,t})^2 $
    • Symbol Explanation:
      • NN: The total number of variates (time series).
      • SS: The prediction length (number of time steps to forecast).
      • yi,ty_{i,t}: The true (actual) value for variate ii at time step tt.
      • y^i,t\hat{y}_{i,t}: The predicted value for variate ii at time step tt.
      • i=1Nt=1S\sum_{i=1}^{N} \sum_{t=1}^{S}: Summation over all variates and all predicted time steps.
  2. Mean Absolute Error (MAE)

    • Conceptual Definition: MAE measures the average of the absolute differences between predictions and actual observations. It gives an idea of the average magnitude of the error without considering its direction. Unlike MSE, it is less sensitive to outliers, as it does not square the errors.
    • Mathematical Formula: $ \mathrm{MAE} = \frac{1}{N \cdot S} \sum_{i=1}^{N} \sum_{t=1}^{S} |y_{i,t} - \hat{y}_{i,t}| $
    • Symbol Explanation:
      • NN: The total number of variates (time series).

      • SS: The prediction length (number of time steps to forecast).

      • yi,ty_{i,t}: The true (actual) value for variate ii at time step tt.

      • y^i,t\hat{y}_{i,t}: The predicted value for variate ii at time step tt.

      • | \cdot |: Absolute value.

      • i=1Nt=1S\sum_{i=1}^{N} \sum_{t=1}^{S}: Summation over all variates and all predicted time steps.

        These metrics are effective for validating a method's performance by providing quantitative measures of prediction accuracy. MSE penalizes larger errors more, which can be important in applications where large deviations are critical, while MAE provides a more robust average error magnitude.

5.3. Baselines

The paper compares iTransformer against 10 well-acknowledged forecasting models, representing different categories of deep forecasters for time series:

  1. Transformer-based methods:

    • Autoformer (Wu et al., 2021): Utilizes decomposition and auto-correlation mechanisms for long-term forecasting.
    • FEDformer (Zhou et al., 2022): Incorporates frequency-enhanced decomposed Transformer for improved long-term forecasting.
    • Stationary (Liu et al., 2022b): Addresses non-stationarity in time series data using adaptive normalization techniques within a Transformer framework.
    • Crossformer (Zhang & Yan, 2023): Designed to explicitly capture cross-time and cross-variate dependencies.
    • PatchTST (Nie et al., 2023): Applies patching (dividing series into sub-series) to Transformers for long-term forecasting, treating patches as tokens.
  2. Linear-based methods: These represent simpler yet effective models that have recently challenged complex deep learning architectures.

    • DLinear (Zeng et al., 2023): A purely linear model that decomposes time series into trend and remainder components and forecasts them separately.
    • TiDE (Das et al., 2023): Time-series Dense Encoder, a simple yet powerful model primarily based on MLPs.
    • RLinear (Li et al., 2023): Revisits linear mapping for long-term time series forecasting, emphasizing simplicity and effectiveness.
  3. TCN-based methods: These models use convolutional neural networks specifically designed for sequential data.

    • SCINet (Liu et al., 2022a): Sample Convolution and Interaction Network, a TCN-based model using interactive downsampling and convolution for time series modeling.

    • TimesNet (Wu et al., 2023): Models temporal 2D-variations by transforming 1D time series into 2D tensors for generalized time series analysis.

      These baselines are representative because they cover the spectrum of modern deep learning approaches to time series forecasting, including various Transformer adaptations, highly competitive linear models, and strong convolutional approaches. This diverse set allows for a comprehensive evaluation of iTransformer's performance across different architectural paradigms and inherent assumptions about time series structure.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that iTransformer consistently achieves state-of-the-art performance across various challenging real-world datasets, particularly excelling in forecasting high-dimensional time series. This validates the effectiveness of its inverted architecture.

The following are the results from Table 1 of the original paper:

Models iTransformer (Ours) RLinear (2023) PatchTST (2023) Crossformer (2023) TiDE (2023) TimesNet (2023) DLinear (2023) SCINet (2022a) FEDformer (2022) Stationary (2022b) Autoformer (2021)
| MSE MAE | | MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
ECL | 0.178 0.270 | 0.205 0.290 | 0.244 0.334 | 0.251 0.340 | 0.219 0.298 | 0.240 0.330 | 0.250 0.339 | 0.257 0.331 | 0.290 0.380 | 0.254 0.361
ETT | 0.383 0.407 | 0.414 0.407 | 0.387 0.400 | 0.513 0.496 | 0.403 0.407 | 0.398 0.404 | 0.400 0.403 | 0.481 0.452 | 0.481 0.456 | 0.507 0.500 | 0.553 0.496
Exchange | 0.360 0.403 | 0.378 0.417 | 0.367 0.404 | 0.940 0.707 | 0.370 0.417 | 0.354 0.414 | 0.358 0.404 | 0.780 0.619 | 0.461 0.454 | 0.610 0.570 | 0.804 0.637
PEMS | 0.113 0.221 | 0.180 0.291 | 0.169 0.281 | 0.326 0.419 | 0.147 0.248 | 0.278 0.375 | 0.114 0.224 | 0.213 0.327 | 0.147 0.249 | 0.667 0.601 |
Solar-Energy | 0.233 0.262 | 0.270 0.307 | 0.261 0.301 | 0.641 0.639 | 0.347 0.417 | 0.290 0.330 | 0.301 0.340 | 0.639 0.639 | 0.381 0.381 | 0.700 0.601 | 0.801 0.610
Traffic | 0.428 0.282 | 0.626 0.378 | 0.481 0.304 | 0.550 0.304 | 0.476 0.300 | 0.650 0.396 | 0.580 0.366 | 0.612 0.338 | 0.756 0.474 | 0.617 0.336 | 0.789 0.505
Weather | 0.258 0.278 | 0.272 0.291 | 0.259 0.281 | 0.259 0.315 | 0.260 0.280 | 0.260 0.292 | 0.259 0.280 | 0.363 0.330 | 0.360 0.336 | 0.292 0.280 | 0.360 0.330

(Note: The original Table 1 had some formatting issues, especially for the multi-column header and alignment. I have transcribed it as best as possible, focusing on data fidelity.)

Key observations from Table 1:

  • Superior Performance: iTransformer achieves the best (red text) or second-best (underlined) MSE/MAE scores across almost all datasets. It holds the top spot for ECL, ETT, Exchange, PEMS, Solar-Energy, Traffic, and Weather, demonstrating its broad applicability and effectiveness.

  • High-Dimensional Series: The model performs particularly well on datasets with a large number of variates, such as ECL (321 variates), Traffic (862 variates), and PEMS (358-883 variates subsets). This supports the claim that iTransformer is effective at modeling multivariate correlations by applying attention on the variate dimension.

  • PatchTST Limitations: PatchTST, a previous state-of-the-art model, surprisingly fails in many cases on the PEMS dataset. The paper attributes this to the extremely fluctuating nature of PEMS series, suggesting that PatchTST's patching mechanism might lose focus on specific local details necessary to handle rapid fluctuations. In contrast, iTransformer, which aggregates whole series variations for series representations, copes better with such situations.

  • Crossformer vs. iTransformer: Despite Crossformer explicitly capturing cross-time and cross-variate dependencies, its performance is still subpar to iTransformer. This suggests that iTransformer's inverted architecture provides a more robust and less noisy way to handle multivariate interactions.

    In essence, the results strongly validate that iTransformer's inverted architecture provides a robust solution for real-world multivariate time series forecasting, outperforming both complex Transformer variants and simpler linear models.

6.2. iTransformers Generality

The paper further explores the generality of the inverted framework by applying it to various Transformer variants, evaluating variate generalization, and assessing performance with increased lookback length.

6.2.1. Performance Promotion

The inverted framework consistently improves the performance of various Transformer models.

The following are the results from Table 2 of the original paper:

Models Transformer (2017) Reformer (2020) Informer (2021) Flowformer (2022) Flashformer (2022)
MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
ECL Original 0.277 0.372 0.338 0.422 0.311 0.397 0.267 0.359 0.285 0.377
+Inverted 0.178 0.270 0.208 0.301 0.216 0.311 0.210 0.293 0.206 0.291
Promotion 35.6% 27.4% 38.4% 28.7% 30.5% 21.6% 21.3% 18.6% 27.8% 22.9%
Traffic Original 0.665 0.363 0.741 0.422 0.764 0.416 0.750 0.421 0.658 0.356
+Inverted 0.428 0.282 0.647 0.370 0.662 0.380 0.524 0.355 0.492 0.333
Promotion 35.6% 22.3% 12.7% 12.3% 13.3% 8.6% 30.1% 15.6% 25.2% 6.4%
Weather Original 0.657 0.572 0.803 0.656 0.634 0.548 0.286 0.308 0.659 0.574
+Inverted 0.258 0.279 0.248 0.292 0.271 0.330 0.266 0.285 0.262 0.282
Promotion 60.2% 50.8% 69.2% 55.5% 57.3% 39.8% 7.2% 7.7% 60.2% 50.8%
  • The inverted framework leads to significant MSE reductions (Promotions) across all tested Transformer variants: 38.9% on Transformer, 36.1% on Reformer, 28.5% on Informer, 16.8% on Flowformer, and 32.2% on Flashformer. This robust improvement highlights that the issue was not the Transformer components themselves, but their conventional application to time series.
  • The framework allows for the integration of efficient attention mechanisms (like Flowformer or FlashAttention) to address computational complexity when the number of variates is large, making it practical for real-world high-dimensional datasets.

6.2.2. Variate Generalization

iTransformer demonstrates strong generalization capabilities on unseen variates.

The following figure (Figure 5 from the original paper) shows the performance of generalization on unseen variates:

Figure 5: Performance of generalization on unseen variates. We partition the variates of each dataset into five folders, train models with \(20 \\%\) variates, and use the partially trained model to for… 该图像是图表,展示了图5中iTransformer模型在不同数据集(ECL、Traffic、Solar-Energy)上对未见变量的泛化能力。图中比较了Transformer、Informer、Reformer和Flowformer模型,在所有变量(100%)和部分变量(20%)条件下的均方误差(MSE)表现。

  • Flexibility: By design, iTransformer's variate tokens allow a flexible number of inputs, meaning it can be trained on a subset of variates and then forecast all variates (including unseen ones) without fine-tuning.
  • Transferable Representations: The shared feed-forward networks learn intrinsic patterns of time series that are transferable across different variates.
  • Comparison with CI-Transformers: When comparing iTransformers with Channel Independence (CI) Transformers (where a shared backbone forecasts each variate independently), iTransformers are shown to be more efficient. CI-Transformers often require predicting each variate one by one during inference, which is time-consuming. iTransformers can predict all variates simultaneously and show smaller increases in error when generalizing from partial training. This suggests a path toward building foundation models for time series based on iTransformer.

6.2.3. Increasing Lookback Length

The inverted framework enables Transformer models to effectively utilize enlarged lookback windows for better predictions, a challenge for conventional Transformer adaptations.

The following figure (Figure 6 from the original paper) shows forecasting performance with increased lookback length:

Figure 6: Forecasting performance with the lookback length \(T \\in \\{ 4 8 , 9 6 , 1 9 2 , 3 3 6 , 7 2 0 \\}\) and fixed prediction length \(S ~ = ~ 9 6\) While the performance of Transformer-based forecas… 该图像是图表,展示了不同模型在ECL和Traffic两个数据集上,随回望长度T{48,96,192,336,720}T \in \{48, 96, 192, 336, 720\}变化的预测性能,纵轴为均方误差(MSE)。结果显示iTransformer在扩大回望窗口时性能明显优于其他Transformer及其变体。

  • Vanilla Transformer-based forecasters often do not benefit from increased lookback length (TT), sometimes even degrading performance, possibly due to distracted attention.
  • iTransformer and its variants, however, show improved performance as the lookback length increases (T{48,96,192,336,720}T \in \{48, 96, 192, 336, 720\}). This indicates that the strategic assignment of the MLPs to the temporal dimension allows the model to leverage more historical information, a characteristic traditionally seen in strong statistical and linear methods.

6.3. Model Analysis

6.3.1. Ablation Study

Ablation studies confirm the rational assignment of Transformer components in iTransformer.

The following are the results from Table 3 of the original paper:

Design Variate Temporal ECL Traffic Weather Solar-Energy
MSE MAE MSE MAE MSE MAE MSE MAE
iTransformer Attention FFN 0.178 0.270 0.428 0.282 0.258 0.278 0.233 0.262
Replace Attention Attention 0.193 0.293 0.913 0.500 0.255 0.280 0.261 0.291
FFN Attention 0.202 0.300 0.863 0.499 0.258 0.283 0.285 0.317
FFN FFN 0.182 0.287 0.599 0.348 0.248 0.274 0.269 0.287
w/o Attention w/o FFN 0.189 0.278 0.456 0.306 0.261 0.281 0.258 0.289
Attention w/o FFN 0.193 0.276 0.461 0.294 0.265 0.283 0.261 0.283

(Note: The original Table 3 had some formatting issues, especially for the multi-column header and alignment. I have transcribed it as best as possible, focusing on data fidelity.)

The full results from Appendix B (Table 6) further elaborate on these findings.

The following are the results from Table 6 of the original paper:

Design Variate Temporal | Prediction | ECL Traffic Weather | Solar-Energy
Lengths | MSE MAE | MSE MAE MSE MAE MSE MAE
iTransformer Attention FFN 96 | 0.148 0.240 | 0.395 0.268 | 0.174 0.214 | 0.203 0.237
Avg | 0.178 0.270 | | 0.428 0.282 | | 0.258 0.279 | | 0.233 0.262 |
Replace Attention Attention 96 | 0.161 0.263 | | 1.021 0.581 | | 0.168 0.213 | | 0.227 0.270 |
Avg | 0.193 0.293 | | 0.913 0.500 | | 0.255 0.280 | | 0.261 0.291 |
FFN Attention 96 | 0.169 0.270 | | 0.907 0.540 | | 0.176 0.221 | 0.247 0.299
Avg | 0.202 0.300 | | 0.863 0.499 | | 0.258 0.283 | | 0.285 0.317 |
FFN FFN 96 | 0.159 0.261 | | 0.606 0.342 | | 0.162 0.207 | | 0.237 0.277 |
Avg | 0.182 0.287 | | 0.599 0.348 | | 0.248 0.274 | | 0.269 0.287 |
w/o Attention FFN 96 | 0.163 0.254 | | 0.427 0.296 | | 0.177 0.219 | | 0.226 0.266 |
Avg | 0.189 0.278 | | 0.456 0.306 | | 0.261 0.281 | | 0.258 0.289 |
w/o Attention w/o FFN 96 | 0.169 0.253 | | 0.437 0.283 | | 0.183 0.220 | | 0.228 0.263 |
Avg | 0.193 0.276 | | 0.461 0.294 | | 0.265 0.283 | | 0.261 0.283 |
  • iTransformer (Attention on Variate, FFN on Temporal) consistently yields the best performance.
  • The vanilla Transformer design (Attention on Temporal, FFN on Temporal, which is equivalent to 'Replace: Attention on Variate, Attention on Temporal' in the table if you map the dimensions, or if interpreted as applying attention on temporal tokens, which leads to Attention-Attention design) performs the worst on Traffic, highlighting the risks of conventional architectures for time series.
  • Replacing Attention with FFN on the variate dimension (Replace: FFN on Variate, Attention on Temporal) leads to worse performance than iTransformer, especially on Traffic. This emphasizes the critical role of attention in capturing multivariate correlations.
  • Replacing FFN with Attention on the temporal dimension (Replace: Attention on Variate, Attention on Temporal) also shows degraded performance, suggesting FFN is better suited for temporal series representation learning.
  • Removing Attention (w/o Attention) or FFN (w/o FFN) both degrade performance, confirming that both components are essential and their specific roles in iTransformer are optimal.

6.3.2. Analysis of Series Representations

The paper uses Centered Kernel Alignment (CKA) similarity to analyze the learned series representations. CKA measures the similarity between representations of different layers in a neural network. Higher CKA similarity between initial and final layer representations has been correlated with better performance in generative tasks like time series forecasting.

The following figure (Figure 7 from the original paper) shows an analysis of series representations and multivariate correlations:

Figure 7: Analysis of series representations and multivariate correlations. Left: MSE and CKA similarity of representations comparison between Transformers and iTransformers. A higher CKA similarity… 该图像是论文中图7的图表,左侧为不同模型的MSE与CKA相似度比较,展示iTransformer在表示学习上的优势;右侧为多变量时间序列的原始相关性与通过倒置自注意力学习的得分图的可视化。

  • As shown in Figure 7 (Left), iTransformers exhibit higher CKA similarity compared to vanilla Transformers. This indicates that iTransformers learn more appropriate and stable series representations, which contributes to their improved forecasting accuracy. This supports the idea that the inverted architecture helps maintain and refine meaningful representations throughout the network.

6.3.3. Analysis of Multivariate Correlations

By assigning the duty of multivariate correlation to the attention mechanism on variate tokens, iTransformer gains enhanced interpretability.

The following figure (Figure 11 from the original paper) shows multivariate correlations of the lookback series and future series and the learned score maps by inverted self-attention of different layers:

Figure 11: Multivariate correlations of the lookback series and future series and the learned score maps by inverted self-attention of different layers. Cases all come from the Solar-Energy dataset. 该图像是图表,展示了论文中案例1到案例3的回顾期和未来时间序列的多变量相关性及不同层的反转自注意力得分图,数据来自Solar-Energy数据集。

  • Interpretability: Figure 7 (Right) and Figure 11 show visualizations of attention score maps for Solar-Energy data. In shallow layers, the learned attention map closely resembles the actual Pearson correlation of the raw lookback series. As the model delves into deeper layers, the attention map progressively aligns more with the correlations of the future series to be predicted.

  • Encoding-Decoding Process: This phenomenon suggests that the iTransformer effectively uses attention to learn and evolve multivariate correlations from the past to predict the future. The encoding of past patterns and decoding for future predictions are implicitly conducted through series representations during the feed-forwarding and layer stacking.

    The following figure (Figure 12 from the original paper) shows a visualization of the variates from the Market dataset and the learned multivariate correlations:

    Figure 12: Visualization of the variates from the Market dataset and the learned multivariate correlations. Each variate represents the monitored interface values of an application, and the applicati… 该图像是图表,展示了Market数据集中两个案例中若干变参的时间序列曲线及其学习到的多变量相关性热力图。左侧为时间序列走势,右侧为对应的相关性矩阵,图中标注了注意力机制关注的特定变参对。

  • Figure 12, using the Market dataset, further demonstrates this. The learned multivariate correlation map clearly shows partitioning corresponding to application categories, and marked correlations reflect high similarity between variates from the same application. This confirms that iTransformer's attention mechanism can identify and leverage meaningful correlations between variates for forecasting.

6.3.4. Efficient Training Strategy

To address the quadratic complexity of self-attention with numerous variates, iTransformer proposes an efficient training strategy.

The following figure (Figure 8 from the original paper) shows an analysis of the efficient training strategy:

Figure 8: Analysis of the efficient training strategy. While the performance (left) remains stable on partially trained variates of each batch with different sampled ratios, the memory footprint (rig… 该图像是图表,展示了图8中不同采样比例下部分训练变量的性能(左图,MSE)和内存占用(右图,GB)变化。结果显示性能保持稳定,而内存占用显著下降,右图中交通数据的内存下降趋势在插图中放大展示。

  • Partial Variate Training: The strategy involves randomly choosing only a part of the variates in each batch for training. Due to the flexibility of variate tokenization, the model can still predict all variates during inference.
  • Performance vs. Memory: Figure 8 illustrates that this strategy maintains comparable performance (left chart) while significantly reducing the memory footprint (right chart). This makes iTransformer more scalable for extremely high-dimensional datasets common in real-world applications.

6.4. Model Efficiency

The paper provides a comprehensive comparison of iTransformer's efficiency against other models.

The following figure (Figure 10 from the original paper) shows model efficiency comparison:

Figure 10: Model efficiency comparison under input-96-predict-96 of Weather and Traffic. 该图像是图表,展示了在输入-96预测-96任务中,Weather(21种变量)和Traffic(862种变量)数据集上多种模型的效率对比,包括训练时间、MSE和显存占用,iTransformer在多个维度表现优异。

  • On datasets with a relatively small number of variates (e.g., Weather with 21 variates), iTransformer is more efficient than other Transformers.
  • On datasets with numerous variates (e.g., Traffic with 862 variates), iTransformer's memory footprint is similar to other Transformers. However, iTransformer can be trained faster. This is because the complexity of the attention module is O(N2)\mathcal{O}(N^2) where NN is the number of tokens. For vanilla Transformers, N=TN=T (lookback length, e.g., 96), but for iTransformer, NN is the number of variates (e.g., 862 for Traffic), which is much larger.
  • The use of linear-complexity attention mechanisms (like those in Flowformer) or the proposed efficient training strategy (partial variate training) can further bring iTransformer's speed and memory footprint in line with linear models, while still offering superior performance on many multivariate tasks.

6.5. Showcases

Visual prediction showcases further highlight iTransformer's superior performance.

The following figure (Figure 13 from the original paper) shows a visualization of input-96-predict-96 results on the Traffic dataset:

Figure 13: Visualization of input-96-predict-96 results on the Traffic dataset. 该图像是多模型预测效果的折线图对比,展示了iTransformer、PatchTST、DLinear、Crossformer、Autoformer和Transformer在Traffic数据集上的input-96-predict-96预测结果。

The following figure (Figure 14 from the original paper) shows a visualization of input-96-predict-96 results on the ECL dataset:

Figure 14: Visualization of input-96-predict-96 results on the ECL dataset. 该图像是对比6种时间序列预测模型在ECL数据集上输入96步预测96步结果的折线图。图中展示了iTransformer、PatchTST、DLinear、Crossformer、Autoformer及Transformer的预测曲线与真实曲线的对比,反映各模型预测的准确性。

The following figure (Figure 15 from the original paper) shows a visualization of input-96-predict-96 results on the Weather dataset:

Figure 15: Visualization of input-96-predict-96 results on the Weather dataset. 该图像是论文中展示的多个时间序列预测模型预测结果对比的折线图,显示在不同模型下预测值与真实值的匹配情况。图中包含iTransformer、PatchTST、DLinear、Crossformer、Autoformer和Transformer六种模型的结果对比。

The following figure (Figure 16 from the original paper) shows a visualization of input-96-predict-96 results on the PEMS dataset:

Figure 16: Visualization of input-96-predict-96 results on the PEMS dataset. 该图像是多个时间序列预测模型的预测曲线和真实值曲线对比图,展示了iTransformer及其他五种模型(PatchTST、DLinear、Crossformer、Autoformer、Transformer)在PEMS数据集上input-96-predict-96任务的表现。

  • Figures 13-16 present visual comparisons of predictions from iTransformer and several baseline models (PatchTST, DLinear, Crossformer, Autoformer, Transformer) on Traffic, ECL, Weather, and PEMS datasets.
  • These visualizations consistently show that iTransformer produces the most precise future series variations, closely tracking the ground truth, compared to other models. This qualitative evidence reinforces the quantitative SOTA results.

6.6. Risks of Embedding Multivariate Points of A Timestamp

The paper provides a qualitative analysis of why the traditional temporal token embedding approach is problematic, especially with real-world complexities.

The following figure (Figure 17 from the original paper) shows a visualization of partial variates of Traffic:

Figure 17: Visualization of partial variates of Traffic. We can observe that several series exhibit strong synchronization (such as Sensor 2 and Sensor 4), and there also exist obvious delays and adv… 该图像是图表,展示了Traffic数据集中部分传感器的路段占用率随时间变化的曲线图。图中可以观察到部分序列(如传感器2和4)表现出较强的同步性,同时也存在明显的时间延迟和提前现象(如传感器1与传感器2,传感器859与传感器861)。

  • Figure 17, from the Traffic dataset, illustrates scenarios where multivariate time series exhibit strong correlations but also obvious phase offsets or systematical time lags (e.g., Sensor 1 and Sensor 2).
  • In such cases, embedding simultaneous time points into a single temporal token (as done in vanilla Transformers) can lead to meaningless attention maps because these points do not represent the same event at the same time. This "interaction noise" degrades performance.
  • iTransformer mitigates this by embedding the whole series of each variate as a token. This makes it more robust to complexities like delayed events, inconsistent measurements, irregularly spaced time series, and systematic delays across variates.

6.7. Data Presentation (Tables)

6.7.1. Full Promotion Results (Table 7)

The following are the results from Table 7 of the original paper:

Datasets Metric ETT ECL PEMS Solar-Energy Traffic Weather
MSE MAE MSE MAE MSE MAE | MSE MAE MSE MAE MSE MAE
Transformer 2.750 1.375 0.277 0.372 0.157 0.263 | 0.256 0.276 0.665 0.363 0.657 0.572
iTransformer 0.383 0.407 0.178 0.270 0.113 0.221 0.233 0.262 0.428 0.282 0.258 0.279
Promotion | 86.1% 70.4% | | 35.6% 27.4% 28.0% 16.0% | 9.0% 5.1% | 35.6% 22.3% 60.2% 50.8%

6.7.2. Full Framework Generality Results (Table 8)

The following are the results from Table 8 of the original paper:

Models Metric Transformer (2017) Reformer (2020) Informer (2021) Flowformer (2022) Flashformer (2022)
MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
ECL Original 96 0.260 0.358 0.312 0.402 0.274 0.368 0.215 0.320 0.259 0.357
192 0.266 0.367 0.348 0.433 0.296 0.386 0.259 0.355 0.274 0.374
Avg 0.277 0.372 0.338 0.422 0.311 0.397 0.267 0.359 0.285 0.377
+Inverted 96 0.148 0.240 0.182 0.275 0.190 0.286 0.183 0.267 0.178 0.265
Avg 0.178 0.270 0.208 0.301 0.216 0.311 0.210 0.293 0.206 0.291
Traffic Original 96 0.647 0.357 0.732 0.423 0.719 0.367 0.493 0.339 0.464 0.320
192 0.649 0.356 0.733 0.420 0.696 0.391 0.691 0.393 0.641 0.348
Avg 0.665 0.363 0.741 0.422 0.764 0.416 0.750 0.421 0.658 0.356
+Inverted 96 0.395 0.268 0.617 0.356 0.632 0.367 0.493 0.339 0.464 0.320
Avg 0.428 0.282 0.647 0.370 0.662 0.380 0.524 0.355 0.492 0.333
Weather Original 96 0.395 0.427 0.689 0.596 0.300 0.384 0.182 0.233 0.388 0.425
192 0.619 0.560 0.752 0.638 0.598 0.544 0.250 0.288 0.619 0.560
Avg 0.657 0.572 0.803 0.656 0.634 0.548 0.286 0.308 0.659 0.574
+Inverted 96 0.174 0.214 0.169 0.225 0.180 0.251 0.183 0.223 0.177 0.218
Avg 0.258 0.279 0.248 0.292 0.271 0.330 0.266 0.285 0.262 0.282

6.7.3. Full Results of Variate Generalization (Table 18)

The following are the results from Table 18 of the original paper:

Datasets Models ECL Traffic Solar-Energy PEMS03 PEMS04 PEMS08
MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
Transformer CI-Transformer 0.366 0.435 0.720 0.428 0.301 0.336 0.170 0.279 0.164 0.278 0.168 0.277
1.238 0.825 1.082 0.771 0.490 0.485 0.260 0.363 0.260 0.360 0.275 0.380
iTransformer 0.183 0.271 0.432 0.284 0.242 0.270 0.113 0.221 0.111 0.221 0.101 0.204
0.225 0.317 0.467 0.302 0.249 0.275 0.164 0.275 0.150 0.262 0.139 0.245
Informer CI-Informer 0.354 0.430 0.767 0.431 0.309 0.342 0.178 0.288 0.175 0.287 0.172 0.282
1.282 0.835 1.077 0.749 0.485 0.482 0.270 0.366 0.268 0.363 0.280 0.378
iTransformer 0.220 0.313 0.658 0.370 0.297 0.334 0.150 0.262 0.147 0.259 0.139 0.247
0.216 0.311 0.662 0.380 0.311 0.347 0.169 0.281 0.169 0.281 0.165 0.276
Reformer CI-Reformer 0.351 0.428 0.749 0.426 0.312 0.344 0.178 0.288 0.175 0.286 0.173 0.283
1.258 0.834 1.073 0.748 0.490 0.485 0.268 0.366 0.265 0.362 0.280 0.378
iTransformer 0.208 0.301 0.647 0.370 0.248 0.292 0.180 0.291 0.175 0.286 0.173 0.283
0.208 0.301 0.647 0.370 0.292 0.339 0.180 0.291 0.180 0.291 0.180 0.291
Flowformer CI-Flowformer 0.288 0.373 0.709 0.419 0.275 0.315 0.162 0.273 0.158 0.270 0.159 0.267
1.082 0.758 1.020 0.718 0.432 0.443 0.252 0.354 0.249 0.350 0.260 0.360
iTransformer 0.210 0.293 0.524 0.355 0.266 0.285 0.147 0.248 0.142 0.240 0.129 0.241
0.210 0.293 0.524 0.355 0.285 0.315 0.147 0.248 0.142 0.240 0.129 0.241

6.7.4. Full Forecasting Results - PEMS (Table 9)

The following are the results from Table 9 of the original paper:

Models iTransformer (Ours) RLinear (2023) PatchTST (2023) Crossformer (2023) TiDE (2023) TimesNet (2023) DLinear (2023) SCINet (2022a) FEDformer (2022) Stationary (2022b) Autoformer (2021)
| MSE MAE |
PEMS03 | 0.113 0.221 | 0.180 0.291 0.169 0.281 0.326 0.419 0.147 0.248 0.278 0.375 0.114 0.224 0.213 0.327 0.147 0.249 0.667 0.601 0.742 0.627
PEMS04 | 0.111 0.221 | 0.195 0.307 0.209 0.314 0.353 0.437 0.129 0.241 0.295 0.388 0.092 0.202 0.231 0.337 0.127 0.240 0.610 0.590 0.610 0.589
PEMS07 | 0.101 0.204 | 0.211 0.303 0.235 0.315 0.441 0.464 0.158 0.244 0.280 0.358 0.100 0.210 0.201 0.296 0.119 0.225 0.504 0.478 0.817 0.645
PEMS08 | 0.150 0.226 | 0.260 0.379 0.307 0.390 0.487 0.509 0.193 0.270 0.379 0.416 0.158 0.244 0.280 0.358 0.201 0.276 0.817 0.645 0.817 0.645
1st Count | 13 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 7 0 0 0 0 0 0

6.7.5. Full Forecasting Results - Long-term (Table 10)

The following are the results from Table 10 of the original paper:

Models iTransformer (Ours) RLinear (2023) PatchTST (2023) Crossformer (2023) TiDE (2023) TimesNet (2023) DLinear (2023) SCINet (2022a) FEDformer (2022) Stationary (2022b) Autoformer (2021)
| MSE MAE | MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
ETTm1 | 0.288 0.332 | 0.286 0.327 0.281 0.326 0.757 0.610 0.358 0.404 0.291 0.333 0.350 0.400 0.571 0.537 0.305 0.349 0.306 0.347 0.327 0.370
ETTh2 | 0.383 0.407 | 0.374 0.398 0.387 0.407 0.942 0.684 0.611 0.550 0.414 0.427 0.559 0.510 0.954 0.723 0.437 0.449 0.526 0.516 0.450 0.450
ECL | 0.178 0.270 | 0.219 0.298 0.205 0.290 0.244 0.334 0.251 0.340 0.219 0.298 0.240 0.330 0.250 0.339 0.290 0.380 0.254 0.361 0.222 0.321
Exchange | 0.360 0.403 | 0.378 0.417 0.367 0.404 0.940 0.707 0.370 0.417 0.354 0.414 0.358 0.404 0.780 0.619 0.461 0.454 0.610 0.570 0.804 0.637
Traffic | 0.428 0.282 | 0.626 0.378 0.481 0.304 0.550 0.304 0.476 0.300 0.650 0.396 0.580 0.366 0.612 0.338 0.756 0.474 0.617 0.336 0.789 0.505
Weather | 0.258 0.278 | 0.272 0.291 0.259 0.281 0.259 0.315 0.260 0.280 0.260 0.292 0.259 0.280 0.363 0.330 0.360 0.336 0.292 0.280 0.360 0.330
Solar-Energy | 0.233 0.262 | 0.270 0.307 0.261 0.301 0.641 0.639 0.347 0.417 0.290 0.330 0.301 0.340 0.639 0.639 0.381 0.381 0.700 0.601 0.801 0.610
1st Count | 16 22 6 12 12 11 3 0 0 1 0 3 0 0 0 4 0 0 0 0

6.7.6. Full Forecasting Results - Market (Table 11)

The following are the results from Table 11 of the original paper:

Models iTransformer (Ours) RLinear (2023) PatchTST (2023) Crossformer (2023) TiDE (2023) TimesNet (2023) DLinear (2023) SCINet (2022a) FEDformer (2022) Stationary (2022b) Autoformer (2021)
MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
| MS AE | SE AE MS MAE MS MAE MS MAE MS MAE MS MAE MS MAE MS MAE MS MAE MS MAE
Market-Merchant | 0.072 0.147 | 0.152 0.247 0.084 0.171 0.117 0.181 0.187 0.289 0.093 0.184 0.110 0.206 0.316 0.385 0.296 0.401 0.172 0.277 0.494 0.531
Market-Wealth | 0.345 0.289 | 0.585 0.461 0.394 0.260 0.429 0.288 0.595 0.481 0.360 0.318 0.501 0.412 0.660 0.514 0.625 0.543 0.499 0.415 0.772 0.612
Market-Finance | 0.184 0.216 | 0.395 0.336 0.210 0.248 5.333 0.618 0.987 0.442 0.516 0.308 0.765 0.372 2.817 0.734 1.621 0.569 1.368 0.643 1.872 0.681
Market-Terminal | 0.065 0.150 | 0.180 0.286 0.077 0.179 0.071 0.162 0.216 0.311 0.080 0.179 0.106 0.210 0.280 0.360 0.295 0.403 0.180 0.296 0.518 0.547
Market-Payment | 0.072 0.144 | 0.143 0.245 0.084 0.174 0.207 0.179 0.208 0.278 0.105 0.182 0.116 0.200 0.288 0.322 0.300 0.330 0.166 0.271 0.417 0.460
Market-Customer | 0.094 0.150 | 0.214 0.261 0.118 0.180 0.309 0.194 0.308 0.307 0.142 0.191 0.184 0.219 0.461 0.385 0.350 0.391 0.242 0.301 0.669 0.593
1st Count 28 27 0 0 0 0 0 0 0 3 0 0 20 0 0 0 0 0 0 0

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully introduces iTransformer, a novel framework for multivariate time series forecasting that re-conceptualizes the application of Transformer components without modifying their inherent design. By inverting the dimensions on which attention and feed-forward networks operate, iTransformer treats individual time series (variates) as tokens. This allows the self-attention mechanism to effectively capture multivariate correlations between different series, while the feed-forward network processes each variate token to learn robust series-global representations. The model achieves state-of-the-art performance across a diverse range of real-world datasets, demonstrating significant improvements over existing Transformer-based and linear forecasting models. Furthermore, iTransformer exhibits enhanced generalization ability to unseen variates and more effectively utilizes arbitrary lookback windows, addressing key limitations of prior Transformer applications in time series.

7.2. Limitations & Future Work

The authors acknowledge that while iTransformer provides a robust backbone, there is still room for improvement and exploration:

  • Large-scale Pre-training: The promising generalization abilities of iTransformer suggest its potential for large-scale pre-training on diverse multivariate time series datasets. This could lead to powerful foundation models for time series, similar to those in NLP and computer vision.
  • More Time Series Analysis Tasks: Future work can explore adapting iTransformer for other time series analysis tasks beyond forecasting, such as anomaly detection, imputation, or classification.
  • Enhanced Inductive Bias in Embedding: The current embedding layer is a simple MLP. The authors suggest that enhancing this embedding with more inductive bias (e.g., using TCN components) could make the model even more robust to various real-world scenarios (like irregular time series) while maintaining the flexibility of the Transformer architecture.
  • Specialized Attention and FFN: Although iTransformer uses native components, the authors hint that elaborately designed components (e.g., more efficient attention for multivariate correlation, temporal dependency modeling under distribution shifts, fine-grained variate tokenization) could be developed based on the inverted architecture to further boost performance.

7.3. Personal Insights & Critique

iTransformer offers a refreshingly simple yet profoundly impactful insight: the power of Transformers doesn't necessarily lie in complex modifications of its core mechanisms, but rather in a judicious re-evaluation of how inputs are tokenized and on which dimensions its components operate. This inverted perspective is a paradigm shift that could fundamentally change how Transformers are applied to multivariate time series.

Inspirations and Applications:

  • Foundation Models for Time Series: The demonstrated variate generalization is particularly exciting. It strongly suggests that iTransformer could be a cornerstone for developing pre-trained foundation models for time series. A single model, trained on a vast and diverse collection of multivariate time series (even with varying numbers of variates), could then be fine-tuned or directly applied to new, unseen series, much like BERT or GPT for text. This would be a significant leap for the field, currently hampered by the need for task-specific models.
  • Robustness to Real-world Data: The paper effectively addresses common real-world complexities like time lags, inconsistent measurements, and non-stationarity. By normalizing each variate token and using attention for multivariate correlations, iTransformer provides a more robust solution compared to models that assume perfect synchronization or homogeneity between variates.
  • Unifying Linear and Transformer Strengths: The design cleverly integrates the observed strengths of linear models (effective temporal modeling) into the FFN applied to variate tokens, while retaining the attention mechanism for multivariate interaction. This hybrid approach potentially offers the best of both worlds.

Potential Issues or Areas for Improvement:

  • Computational Cost for Very High Variate Count: While the paper suggests using efficient attention mechanisms or partial variate training to mitigate the O(N2)\mathcal{O}(N^2) complexity of attention for a large number of variates (NN), this quadratic dependency on NN remains a theoretical bottleneck for truly massive multivariate datasets (e.g., N>10000N > 10000) if dense attention is used. Exploring alternative sparse attention strategies intrinsically designed for variate-wise interaction could be beneficial.

  • Implicit Temporal Ordering: The paper states that positional encoding is not needed because temporal order is implicitly stored within the variate token (as it's an embedding of the whole series). While this might be true for the current MLP embedding, the explicit temporal position might still be valuable for some highly temporal-sensitive tasks or for longer lookback windows where fine-grained temporal information could be lost in a compressed variate token. Further investigation into temporal encoding within the variate token could be an area of study.

  • Understanding Variate Token Semantics: While the attention maps are shown to be interpretable for multivariate correlations, a deeper dive into the exact semantic content learned by the MLP embedding for each variate token could provide more insights. What "intrinsic properties" (amplitude, periodicity, frequency spectrums) are truly being captured, and can these be explicitly controlled or influenced?

    Overall, iTransformer represents a strong argument for re-examining fundamental architectural choices in applying powerful deep learning models to specific data modalities. Its success on diverse benchmarks and its potential for fostering foundation models make it a highly significant contribution to time series forecasting.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.