Paper status: completed

Continual Learning for Time Series Forecasting: A First Survey

Published:07/17/2024
Original Link
Price: 0.100000
6 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This survey reviews continual learning methods for time series forecasting, addressing data and concept drift, and balancing plasticity and stability to adapt models in dynamic, nonstationary environments.

Abstract

HAL Id: hal-04836655 https://hal.science/hal-04836655v1 Submitted on 13 Dec 2024 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL , est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Distributed under a Creative Commons Attribution 4.0 International License Continual Learning for Time Series Forecasting: A First Survey Quentin Besnard, Nicolas Ragot To cite this version: Quentin Besnard, Nicolas Ragot. Continual Learning for Time Series Forecasting: A First Sur- vey. ITISE 2024, Jul 2023, Gran Canaria, Spain, Spain. pp.49, ￿10.3390/engproc2024068049￿. ￿hal- 04836655￿

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of the paper is "Continual Learning for Time Series Forecasting: A First Survey". This indicates that the paper provides an initial overview and review of the application of continual learning techniques specifically to the domain of time series forecasting.

1.2. Authors

The authors of the paper are Quentin Besnard and Nicolas Ragot.

  • Quentin Besnard is affiliated with Université de Tours, Laboratoire d'Informatique Fondamentale et Appliquée de Tours (LIFAT), 37200 Tours, France. His email is quentin.besnard@univ-tours.fr.
  • Nicolas Ragot is also affiliated with Université de Tours, Laboratoire d'Informatique Fondamentale et Appliquée de Tours (LIFAT), 37200 Tours, France. His email is nicolas.ragot@univ-tours.fr.

1.3. Journal/Conference

The paper was published in Eng. Proc. 2024, 68, 49. It was presented at the 10th International Conference on Time Series and Forecasting (ITISE 2024), held in Gran Canaria, Spain, from 15-17 July 2024. Eng. Proc. (Engineering Proceedings) is a proceedings journal that publishes papers from various engineering conferences. While not a top-tier journal, publishing in conference proceedings is a common way to disseminate new research, especially in rapidly evolving fields. The ITISE conference is specialized in time series and forecasting, making it a relevant venue for this survey.

1.4. Publication Year

The paper was published on July 17, 2024. It was submitted on December 13, 2024, as indicated by the HAL ID entry (this might be a typo in the provided text, as publication usually follows submission, but I will stick to the provided information).

1.5. Abstract

Deep learning has achieved significant progress across various fields like robotics and image processing. However, a major challenge for neural networks is their high demand for quantitative and stationary data for proper model computation. Many real-life applications, especially in dynamic environments, cannot meet these constraints due to data and concept drift, where data distribution or goals change over time. Continual learning (CL) aims to address these issues by developing evolving models that can adapt over time, navigating the plasticity/stability dilemma while considering resource limitations. While CL efforts are seen across deep learning applications, its application to time series, particularly in regression and forecasting, is still limited. This paper serves as a first survey of continual learning applied to time series forecasting.

The original source link provided is /files/papers/690053ceed47de95d44a33ed/paper.pdf. This appears to be a local or internal file path. The HAL ID link is https://hal.science/hal-04836655v1, and the official citation link is https://doi.org/10.3390/engproc2024068049. These indicate it is an officially published work, likely part of conference proceedings.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to address is the limitation of traditional deep learning models in dynamic, real-world environments where data patterns and distributions are constantly changing. Standard neural networks require large amounts of stationary data (data whose statistical properties do not change over time) for effective training. However, in many practical scenarios, data drift (changes in data distribution) and concept drift (changes in the underlying relationship between input and output variables) are common. This non-stationarity makes pre-trained, static models quickly obsolete or inaccurate, necessitating frequent and costly retraining from scratch, which is often infeasible due to computational and data storage constraints.

This problem is important because artificial intelligence (AI) applications are increasingly deployed in nonstationary environments, such as energy management, climate prediction, and traffic analysis. The ability of AI models to adapt continuously to these evolving conditions is crucial for their long-term effectiveness and reliability. Existing research in continual learning (CL) has primarily focused on classification tasks (e.g., image recognition) and hasn't adequately addressed the unique challenges of time series forecasting where the output is a continuous value (regression) and the temporal dependency is critical.

The paper's entry point is to provide the "first survey" on continual learning specifically for time series forecasting. By doing so, it aims to consolidate the scattered knowledge in this nascent field, highlight existing approaches, identify gaps, and propose future research directions, thereby catalyzing further development.

2.2. Main Contributions / Findings

The paper's primary contributions are:

  • First Survey: It provides the first comprehensive review of continual learning techniques applied to time series forecasting, compiling existing works and concepts in this specific domain.

  • Conceptual Clarification: It clarifies key continual learning principles, including the plasticity/stability dilemma and catastrophic forgetting, and categorizes the main strategies to address them (rehearsal, regularization, architectural modification).

  • Domain-Specific Analysis: It analyzes how continual learning scenarios (e.g., task incremental, domain incremental) from classification translate (or don't) to time series forecasting, introducing data domain incremental learning and target domain incremental learning.

  • Identification of Trends and Challenges: It identifies current research trends, such as combining CL strategies and biologically inspired models, and highlights significant challenges, including the lack of common benchmarks, the need for task-free solutions, and the development of new metrics for forgetting rate, plasticity/stability, and resource consumption.

  • Application Scope: It reviews the main application domains where CL for time series forecasting is being explored (primarily energy, industrial maintenance, climate, traffic analysis) and emphasizes its potential impact on broader fields like medicine and environmental studies.

    The key conclusions and findings include:

  • Continual learning models demonstrate superior performance compared to fine-tuning and often approach the performance of joint training (an upper bound) in nonstationary environments while effectively mitigating catastrophic forgetting.

  • Despite promising results, the field of CL for time series forecasting is nascent and lacks standardization, particularly in terms of common benchmarks and task definition.

  • Existing CL strategies, largely adapted from classification, need further refinement and novel development to fully address the specific characteristics of time series data and regression tasks.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a foundational understanding of several key concepts is essential, particularly for a beginner.

  • Deep Learning (DL): A subfield of machine learning that uses artificial neural networks with multiple layers (hence "deep") to learn complex patterns from data. Unlike traditional machine learning algorithms, deep learning models can automatically extract features from raw data, reducing the need for manual feature engineering. It has driven significant advancements in AI in areas like image recognition, natural language processing, and robotics.

  • Neural Networks (NNs): Computational models inspired by the structure and function of biological neural networks. They consist of interconnected nodes (neurons) organized in layers. Each connection has a weight associated with it, and neurons apply an activation function to the weighted sum of their inputs to produce an output. Through a process called training, the weights are adjusted to minimize the difference between the network's predictions and the actual target values.

  • Time Series: A sequence of data points indexed in time order. Examples include stock prices, weather measurements, energy consumption, and sensor readings. The key characteristic is the inherent temporal dependency, meaning that past observations can influence future ones.

  • Time Series Forecasting: The process of predicting future values of a time series based on its past values and potentially other related factors. It's a type of regression task where the output is a continuous number. Traditional methods include ARIMA, Exponential Smoothing, and Prophet, while deep learning models like Recurrent Neural Networks (RNNs), LSTMs, GRUs, and Transformers are increasingly used.

  • Regression: A machine learning task where the goal is to predict a continuous numerical value. For example, predicting house prices, temperature, or energy demand. This contrasts with classification.

  • Classification: A machine learning task where the goal is to predict a categorical label or class. For example, classifying an image as a "cat" or "dog," or determining if an email is "spam" or "not spam."

  • Nonstationary Environments: Environments where the underlying statistical properties of the data change over time. This means that patterns learned from past data may not hold true for future data, making models trained on historical data less effective over time.

  • Data Drift: A phenomenon where the statistical properties of the input data change over time. For example, in a sensor network, sensor readings might gradually shift due to environmental factors or sensor degradation. The relationship between input and output might remain the same, but the input distribution changes.

  • Concept Drift: A phenomenon where the underlying relationship between the input variables and the target variable changes over time. For example, consumer preferences for a product might shift, meaning that the features that predicted sales in the past no longer do so effectively. This is a more profound change than data drift.

  • Continual Learning (CL) (also known as lifelong learning, incremental learning, or progressive learning): A subfield of machine learning focused on enabling models to continuously learn new tasks or new information from a stream of data without forgetting previously acquired knowledge. The goal is to build models that can adapt and evolve over their lifetime.

  • Catastrophic Forgetting: A major challenge in continual learning where a neural network, upon learning a new task or new data, rapidly and severely forgets knowledge acquired from previous tasks or data. This happens because updating the network's weights for new information often overwrites the weights critical for old information.

  • Plasticity/Stability Dilemma: A fundamental trade-off in continual learning. Plasticity refers to a model's ability to quickly adapt and learn new information. Stability refers to its ability to retain previously learned knowledge. A highly plastic model might be prone to catastrophic forgetting, while a very stable model might struggle to learn new tasks. Continual learning aims to find a balance between these two conflicting objectives.

3.2. Previous Works

The paper frames its discussion around existing continual learning strategies designed to combat catastrophic forgetting, primarily developed for classification tasks, and then explores their application to time series forecasting.

The three main categories of strategies for continual learning are:

  1. Rehearsal (or Data Replay) Strategies:

    • Conceptual Definition: These methods aim to prevent catastrophic forgetting by periodically re-exposing the model to a small subset of previously seen data (or synthetic data resembling it) while it learns new information. The idea is that by rehearsing past knowledge, the model reinforces the weights associated with it and prevents them from being completely overwritten by new learning.
    • Mechanism: During the training process for a new task or data batch, a small buffer of old data samples (or distilled knowledge from them) is mixed with the new data, and the model is trained on this combined set.
    • Example (Conceptual): Imagine a student learning topic A, then topic B. To prevent forgetting A, the student periodically reviews notes from topic A while studying topic B.
    • Challenges: The main challenge is managing the rehearsal buffer: deciding which past samples to store, how many, and how often to replay them. A larger buffer leads to better retention but higher memory and computational costs. Rehearsal-free versions try to generate synthetic data or use other means to simulate replay without storing actual past samples.
    • Cited Works: [13, 25, 27, 29]
  2. Regularization-based Strategies:

    • Conceptual Definition: These methods add a penalty term to the model's loss function during training. This penalty discourages significant changes to weights that are deemed important for previously learned tasks. The goal is to make the learning process for new tasks path-constrained, preserving the knowledge acquired from old tasks.
    • Mechanism: When learning a new task, the loss function is modified to include a term that measures how much the current weight updates deviate from the weights that were optimal for previous tasks. Weights that are more "important" for past knowledge are penalized more heavily if they change significantly.
    • Example: Elastic Weight Consolidation (EWC) [12, 33, 34]:
      • Conceptual Definition: EWC is a prominent regularization technique. It identifies which weights in the neural network are most crucial for previously learned tasks and then selectively slows down their learning rate when a new task is encountered. This is analogous to a human memory system where certain memories are "hardened" and resistant to change.
      • Mathematical Formula: The loss function for learning a new task LBL_B (after having learned task AA) is modified as: $ L(W) = L_B(W) + \sum_i \frac{\lambda}{2} F_i (\theta_i - \theta_{A,i}^*)^2 $
      • Symbol Explanation:
        • L(W): The total loss function for the current training step, considering both the new task and preventing forgetting.
        • LB(W)L_B(W): The standard loss function for the new task BB (e.g., cross-entropy for classification, MSE for regression) using the current network weights WW.
        • i\sum_i: Summation over all individual weights ii in the network.
        • λ\lambda: A hyperparameter that controls the strength of the regularization term. A higher λ\lambda means more emphasis on protecting old knowledge.
        • FiF_i: The Fisher Information Matrix diagonal entry for weight ii. The Fisher Information Matrix estimates the importance of each weight for the previous task AA. A higher FiF_i means weight ii is more critical for task AA.
        • θi\theta_i: The current value of weight ii.
        • θA,i\theta_{A,i}^*: The value of weight ii after learning task AA (i.e., the optimal weight for task AA).
      • Purpose: The regularization term (θiθA,i)2(\theta_i - \theta_{A,i}^*)^2 penalizes the squared difference between the current weight and its optimal value for task AA. This penalty is scaled by FiF_i and λ\lambda, effectively creating "elastic" constraints around important weights.
    • Challenges: Calculating and storing the Fisher Information Matrix can be computationally expensive and memory-intensive for large networks and many tasks. Also, if tasks are too disjoint, regularization might lead to an "average" model that performs poorly on all tasks.
    • Cited Works: [12, 33, 34] (for EWC), [15] (for Synaptic Intelligence, another regularization method).
  3. Architectural Modification (or Model Extension) Strategies:

    • Conceptual Definition: These methods address catastrophic forgetting by dynamically altering the network's architecture (e.g., adding or removing neurons/layers) as new tasks are learned. Instead of modifying existing weights for old tasks, new parts of the network are dedicated to new tasks, preserving the old parts.

    • Mechanism: When a new task arrives, the model's capacity is increased by adding new neurons, layers, or even entire subnetworks. The parts of the network responsible for previous tasks are "frozen" or isolated from modification.

    • Example (Conceptual): Imagine building new rooms onto a house for new purposes, rather than constantly re-purposing existing rooms and potentially damaging their original function.

    • Challenges: This approach leads to a continuously growing model size, which can become computationally and memory-prohibitive. Techniques like pruning (removing unnecessary connections or neurons) can help mitigate this but are complex to implement without performance degradation.

    • Cited Works: [30, 31, 32]

      These three main strategies can also be combined into hybrid versions to leverage their respective strengths, offering trade-offs between performance, memory, and computational cost [1, 17, 19, 35–37].

3.3. Technological Evolution

The evolution of deep learning and continual learning has followed a path of increasing complexity and specialization:

  • Early Deep Learning: Focused on static datasets, achieving breakthroughs in image recognition (CNNs) and natural language processing (RNNs, LSTMs). These models excelled when the training and test data shared the same distribution.
  • Emergence of Continual Learning: As deep learning models moved into real-world applications, the problem of catastrophic forgetting in dynamic environments became apparent. This spurred research into continual learning, initially heavily focused on classification tasks, driven by standardized benchmarks like MNIST and CIFAR.
  • Development of CL Strategies: The three main categories (rehearsal, regularization, architectural) emerged as solutions, each with its own advantages and disadvantages, constantly being refined and combined.
  • Application to Time Series: More recently, with the increasing use of deep learning for time series forecasting, the challenges of nonstationary environments and concept drift have highlighted the need for continual learning in this domain. This paper sits at this juncture, surveying these early exploratory efforts. This work fits within the timeline as a consolidator of early research efforts, acting as a foundational review for a relatively new subfield.

3.4. Differentiation Analysis

Compared to the main methods in related work, this paper differentiates itself not by proposing a new continual learning algorithm or model, but by being the first dedicated survey focusing specifically on the intersection of continual learning and time series forecasting.

  • Focus on Time Series Forecasting: While continual learning research is abundant, the vast majority of it targets classification tasks (e.g., image classification). This paper explicitly shifts the focus to regression tasks on time series data, highlighting the unique challenges and opportunities in this domain (e.g., specific types of drift, lack of task boundaries, temporal dependencies).
  • Survey Nature: Instead of introducing a novel algorithm, the paper's innovation lies in its comprehensive review, categorization, and critical analysis of existing CL techniques and their nascent applications in time series forecasting. It acts as a guide, consolidating scattered knowledge, identifying key concepts, outlining current trends, and pinpointing significant research gaps (like the absence of common benchmarks).
  • Identification of Specific Scenarios: It introduces and clarifies data domain incremental learning and target domain incremental learning as the primary continual learning scenarios relevant to time series forecasting, distinguishing them from the task incremental learning often seen in classification.
  • Highlighting Gaps: A crucial differentiation is its emphasis on the lack of task-free solutions and common benchmarks, which are more prevalent in classification CL. This sets the agenda for future research, urging the community to develop solutions tailored to time series.

4. Methodology

As a survey paper, this work does not propose a novel algorithm or model in the traditional sense. Instead, its methodology revolves around structuring and analyzing the existing literature on continual learning within the context of time series forecasting. The authors' approach involves:

  1. Defining Core Principles: Establishing a common understanding of continual learning concepts.
  2. Categorizing Strategies: Grouping existing continual learning methods based on their underlying mechanisms.
  3. Analyzing Domain-Specific Challenges: Adapting continual learning scenarios from classification to time series and identifying unique problems.
  4. Reviewing Applications and Results: Summarizing findings from applied works in time series forecasting.
  5. Identifying Trends and Future Work: Outlining open problems and promising research directions.

4.1. Principles

The core idea guiding the survey is the fundamental challenge of catastrophic forgetting and the plasticity/stability dilemma in deep learning models operating in nonstationary environments. The theoretical basis is that standard models, trained on fixed datasets, cannot adapt to evolving data distributions or goals without losing previously acquired knowledge. Continual learning aims to solve this by enabling models to continuously learn and adapt over time.

  • Stability: A model exhibits stability if it can accurately predict outcomes when the data distribution, patterns, or environment remain constant. This is ideal for fixed, perfectly sampled datasets but fails when changes occur.
  • Plasticity: A model exhibits plasticity if it can continuously and quickly adapt to new tasks and data. While crucial for adaptation, excessive plasticity can lead to catastrophic forgetting as new learning overwrites old knowledge.
  • Compromise: Continual learning seeks to find a compromise between plasticity and stability, allowing the model to adapt to new information without forgetting old information, ideally within given material and computational constraints.

4.2. Core Methodology In-depth (Survey Structure and Analysis Framework)

The survey's methodological framework involves analyzing existing continual learning literature through the lens of its three main strategies, then examining how these apply to time series forecasting and the specific challenges encountered.

4.2.1. Strategies to Address Catastrophic Forgetting

The paper outlines three primary strategies to address catastrophic forgetting, which form the basis for classifying continual learning approaches:

  1. Data Rehearsal (Replay):

    • Principle: Constantly reintroducing samples of past data into the learning steps to adapt the model. This stimulates the model with past knowledge to prevent its degradation.
    • Mechanism: A memory buffer stores a selection of past data. When learning new data, the model is trained on a mix of new data and samples from this memory buffer.
    • Advantages: Often yields the best results in terms of model accuracy for retaining old knowledge.
    • Drawbacks: Leads to a perpetual increase in the amount of data the model must manage over time, increasing memory and computational costs. Rehearsal-free versions are being explored to mitigate this.
    • Research Directions: Management of data to be reintroduced (e.g., selection criteria for the buffer).
  2. Architectural Modification (Model Extension):

    • Principle: Adapting the model's architecture directly as new tasks or data are encountered. This avoids modifying parts of the network that are used for previous tasks.
    • Mechanism: The size of network layers or the number of components grows over time. New tasks are learned by adding new neurons, layers, or subnetworks, while existing weights corresponding to past tasks are often frozen.
    • Advantages: No loss in performance on previously learned tasks because weights are preserved.
    • Drawbacks: Continuous growth in model size, leading to increased memory footprint and computational requirements. Pruning mechanisms can address this but are not well-established for continual learning.
  3. Regularization Approach:

    • Principle: Penalizing the modification of neural network weights based on their estimated importance for previously learned tasks. The goal is to constrain weight updates to retain prior knowledge.
    • Mechanism: An additional term is added to the loss function during training. This term quantifies the importance of each weight for past tasks (e.g., using Fisher Information) and penalizes changes to important weights.
    • Example (EWC as explained in Foundational Concepts): $ L(W) = L_{new_task}(W) + \sum_i \frac{\lambda}{2} F_i (\theta_i - \theta_{old_task,i}^*)^2 $ where Lnew_task(W)L_{new\_task}(W) is the loss for the new task, λ\lambda is the regularization strength, FiF_i is the Fisher information for weight ii, θi\theta_i is the current weight value, and θold_task,i\theta_{old\_task,i}^* is the optimal weight value for previous tasks.
    • Advantages: Does not imply network growth or data replay, making it potentially more scalable in the long run.
    • Drawbacks: Can still exhibit some forgetting if tasks are too disjoint or the regularization is not perfectly tuned. It might lead to a model that is an "average" performer across many diverse tasks.
    • Hybrid Versions: The paper notes that these three strategies can be combined (e.g., regularization with data replay) to achieve better performance by leveraging their respective strengths.

4.2.2. Continual Learning Scenarios for Time Series Forecasting

The paper highlights that continual learning research in classification typically focuses on three scenarios:

  • Task Incremental Learning: Sequentially learning to solve distinct tasks, where a task is defined by a goal within a context.

  • Class Incremental Learning: Discriminating between incrementally observed classes.

  • Domain Incremental Learning: Learning to solve the same problem in different contexts (e.g., recognizing the same objects in different lighting conditions).

    For time series forecasting, the paper points out that the task incremental learning scenario (as defined for classification) is generally not addressed. Instead, two main scenarios are identified, as outlined in [38]:

  1. Incremental Learning of the Data Domain:
    • Principle: The underlying data generation process changes over time due due to the nonstationarity of the data stream. The distribution of the data relative to the same objective varies.
    • Example: Predicting energy consumption, where consumption patterns shift due to seasonal changes or new appliance usage over time, but the goal (predicting consumption) remains the same. This is equivalent to data drift or concept drift where the prediction target remains constant.
  2. Incremental Learning of the Target Domain:
    • Principle: The output of the model varies over time, meaning the number or properties of prediction targets change.
    • Example: A model initially predicting single-day-ahead temperature might later need to predict multi-day-ahead temperatures, or new variables might need to be predicted (e.g., adding humidity alongside temperature). This involves changes in the prediction horizon or the output space.

4.2.3. Task Management Approaches

The paper further distinguishes how tasks (or evolving data scenarios) are managed in continual learning for time series:

  • Task-based Approaches:
    • Principle: The dataset is arbitrarily separated into fixed-size subsets, each corresponding to a "new task." This is common for evaluation of CL systems.
    • Drawbacks: This arbitrary separation may not represent real-world transitions between tasks and primarily focuses on incremental learning through data progression, leaving catastrophic forgetting solely to the CL strategy employed. It often relies on external supervision (human or machine) to define task boundaries.
    • Cited Works: [24, 27, 29, 39–46]
  • Task-free Approaches:
    • Principle: The model continuously learns without explicit, predefined task boundaries. Task changes are detected automatically, often based on internal cues or data characteristics.
    • Advantages: More realistic for dynamic environments where explicit task boundaries are unknown or fluid. Better task management (e.g., through unsupervised detection of changes) can help in forgetting management and balancing plasticity/stability.
    • Cited Works: [35, 47] (e.g., [47] proposes task detection through loss analysis, [35] uses a novelty data buffer).

4.2.4. Application Domains

The survey identifies that continual learning for forecasting is predominantly applied in the energy domain (e.g., load prediction, renewable energy forecasting) [24, 35, 36, 39–41, 46]. Other applications include industrial maintenance [36, 43], climatic analysis [36, 42, 45], and traffic analysis [27, 36]. The diversity of datasets across these applications currently hinders direct model comparisons.

5. Experimental Setup

As a survey paper, this work does not present its own experimental setup. Instead, it discusses the types of datasets, evaluation metrics, and baselines used in the continual learning for time series forecasting literature it reviews. The authors use results from existing papers to illustrate the effectiveness of continual learning approaches.

5.1. Datasets

The paper notes that most research on continual learning for regression tasks (time series forecasting) focuses on specific applications, often using distinct datasets.

  • Sources: The application domains mentioned include:
    • Energy Domain: Energy load prediction, wind farm energy production [24, 35, 36, 39–41, 46]. These datasets typically involve sequences of energy consumption or generation over time, often exhibiting daily and seasonal periodicity but also subject to climatic variations and data shifts.
    • Industrial Maintenance: Data related to equipment performance, sensor readings, or failure rates [36, 43].
    • Climatic Analysis: Meteorological data, greenhouse conditions [36, 42, 45].
    • Traffic Analysis: Traffic flow data [27, 36].
  • Characteristics: These datasets are primarily time series data, characterized by temporal dependencies, potential periodicity, and susceptibility to nonstationarity (e.g., data drift or concept drift).
  • Lack of Standardization: A significant observation by the authors is the lack of common benchmarks in the continual learning literature for time series prediction. This diversity of datasets makes direct comparison between different CL models and algorithms challenging.
  • Example of data behavior: While no concrete data samples are provided in the survey itself, the discussion around wind farm energy production (Experiment 1 in [35]) implies time series of power output, which is influenced by wind speed and other climatic factors, showcasing daily and seasonal periodicity with additional climatic variations causing data shifts.

5.2. Evaluation Metrics

The paper discusses common regression metrics and introduces a specific continual learning metric.

  • Mean Squared Error (MSE):

    1. Conceptual Definition: MSE measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. It is a common metric used to quantify the magnitude of error in regression tasks. A lower MSE indicates a better fit of the model to the data. It penalizes large errors more heavily than small errors due to the squaring operation.
    2. Mathematical Formula: $ \mathrm{MSE} = \frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 $
    3. Symbol Explanation:
      • nn: The total number of data points (observations).
      • YiY_i: The actual (observed) value for the ii-th data point.
      • Y^i\hat{Y}_i: The predicted value for the ii-th data point.
      • i=1n\sum_{i=1}^{n}: The sum over all nn data points.
  • R-squared (R2R^2):

    1. Conceptual Definition: R2R^2 (coefficient of determination) is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. It indicates how well the model predicts the outcome. Values range from 0 to 1, with higher values indicating a better fit. (Mentioned in Figure 1 of the paper, implicitly as a performance metric for forecasting).
    2. Mathematical Formula: $ R^2 = 1 - \frac{\sum_{i=1}^{n} (Y_i - \hat{Y}i)^2}{\sum{i=1}^{n} (Y_i - \bar{Y})^2} $
    3. Symbol Explanation:
      • nn: The total number of data points (observations).
      • YiY_i: The actual (observed) value for the ii-th data point.
      • Y^i\hat{Y}_i: The predicted value for the ii-th data point.
      • Yˉ\bar{Y}: The mean of the actual (observed) values.
      • i=1n\sum_{i=1}^{n}: The sum over all nn data points.
  • Forgetting Ratio:

    1. Conceptual Definition: This metric specifically quantifies the degree to which a continual learning model forgets previously learned knowledge when learning new tasks. A lower forgetting ratio is desirable, indicating better retention of old information. It is crucial for evaluating continual learning strategies.
    2. Mathematical Formula: As provided in the paper (Equation (1)): $ forgetting~ratio = \frac{\mathrm{max}(0, L_{warm_up}^2 - L_{warm_up}^1)}{L_{warm_up}^1} $
    3. Symbol Explanation:
      • Lwarm_up1L_{warm\_up}^1: The Mean Squared Error (MSE) on a specific warm-up dataset (a dataset representing prior knowledge or tasks) at the end of the initial warm-up phase (initial training period).
      • Lwarm_up2L_{warm\_up}^2: The MSE on the same warm-up dataset after an update phase (learning new tasks/data).
      • max(0,)\mathrm{max}(0, \cdot): This function ensures that the forgetting ratio is non-negative. If the error on the old task decreases after learning new tasks (meaning the model actually improved on the old task), the forgetting ratio is set to 0, indicating no forgetting.
      • Purpose: This formula measures the increase in error on old tasks after learning new ones, normalized by the initial error on those old tasks. An increase in error means forgetting has occurred.

5.3. Baselines

The models against which continual learning approaches are typically compared in the surveyed literature include:

  • Fine-tuning:
    • Conceptual Definition: A standard approach where a pre-trained model (trained on older data) is further trained on new data without specific mechanisms to prevent catastrophic forgetting. The entire model's weights are updated based on the new data.
    • Why it's a baseline: It represents the performance of a naive adaptation strategy, often leading to significant forgetting of old tasks.
  • Joint Training (or Joint Learning):
    • Conceptual Definition: The model is trained on all available data (old and new) simultaneously, as if all data were available from the beginning.
    • Why it's a baseline: This represents an upper bound of performance for continual learning because it has access to a complete and representative dataset. Continual learning models aim to approach joint training performance without its computational and memory demands.
  • Standard Model (without CL strategies):
    • Conceptual Definition: A neural network model trained incrementally on new data without any continual learning mechanisms (i.e., no rehearsal, regularization, or architectural modification).
    • Why it's a baseline: This serves as a direct comparison point to demonstrate the efficacy of CL strategies in mitigating catastrophic forgetting.

6. Results & Analysis

The survey compiles and analyzes findings from various research papers to demonstrate the effectiveness and challenges of continual learning in time series forecasting. The analysis primarily relies on performance metrics like MSE and R-squared, and forgetting rate.

6.1. Core Results Analysis

The main experimental results presented across the surveyed literature strongly validate the effectiveness of continual learning approaches compared to standard fine-tuning for time series forecasting in nonstationary environments.

  • Superiority over Fine-tuning: Continual learning models consistently outperform fine-tuning by retaining knowledge from past tasks. Fine-tuning, without CL mechanisms, often exhibits a significant loss of precision on previously learned tasks. As shown in Figure 1, lifelong learning (a form of continual learning) maintains performance close to the upper bound set by joint training, whereas fine-tuning suffers from performance degradation on older tasks (e.g., task A, B) as new tasks (C) are learned.

    Figure 1. Comparison of fine-tuning, lifelong learning, and joint training for sequential task learning A (blue) \( \\mathtt { B }\) (green) \( \\mathbf { C }\) (red)—Figure 11— \[24\].
    该图像是图表,展示了图1中用于顺序任务学习的微调、终身学习和联合训练三种方法在任务A、B、C上随训练轮数的表现对比,纵轴为预测性能指标R2R^2,横轴为训练轮数。

    Figure 1. Comparison of fine-tuning, lifelong learning, and joint training for sequential task learning A (blue) B \mathtt { B } (green) C \mathbf { C } (red)—Figure 11— [24].

  • Approaching Joint Training Performance: Continual learning models, particularly those employing rehearsal or effective regularization, can achieve performance levels close to joint training. Joint training represents an ideal scenario where all data is available simultaneously, thus CL aims to mimic this performance without its resource demands.

  • Lower Forgetting Rate: Studies explicitly measuring the forgetting rate confirm that continual learning models significantly reduce catastrophic forgetting compared to baseline models without CL strategies. Figure 3 illustrates this, showing that various CL algorithms (LWF, EWC, O-EWC, SI) maintain a much lower forgetting rate as the number of tasks increases, in contrast to a standard baseline. This directly addresses the plasticity/stability dilemma by enabling adaptation without severe knowledge loss.

    Figure 3. Performance of algorithms with increasing number of tasks by 2 in each step in the task and data domain incremental scenario—Figures 3 and 4— \[41\].
    该图像是图表,展示了在数据域增量场景和任务域增量场景中,随着训练任务数量增加,对不同算法遗忘率的影响。横轴为训练任务数量,纵轴为遗忘率,包含Baseline、LWF、EWC、Online-EWC和SI五种算法性能对比。

    Figure 3. Performance of algorithms with increasing number of tasks by 2 in each step in the task and data domain incremental scenario—Figures 3 and 4— [41].

  • Improved Accuracy in Dynamic Environments: As illustrated in Figure 2, continual learning models (LWF, EWC, O-EWC, SI) generally achieve lower MSE values compared to a baseline model in data and target domain incremental scenarios. This suggests that CL is effective in adapting to evolving data distributions and changing prediction targets, even though it may come at the expense of increased training time.

    Figure 2. Comparison of training time and average MSE on test datasets over 20 experiments for algorithms in the data and target domain scenario—Figures 1 and 2— \[41\].
    该图像是图表,对比了数据域增量场景和任务域增量场景中多种算法在测试数据集上的平均均方误差(MSE)与训练时间(秒)。图中用不同颜色和形状标识Baseline、LWF、EWC、Online EWC和SI算法,展示它们在20次实验中的表现差异。

    Figure 2. Comparison of training time and average MSE on test datasets over 20 experiments for algorithms in the data and target domain scenario—Figures 1 and 2— [41].

  • Specific Case of CLeaR Model: The CLeaR model [35] demonstrates the practical advantage of continual learning in wind farm forecasting. While it showed mixed results on purely periodic artificial data, its application to real-world wind farm data (which has periodicity but also climatic variations and data shifts) clearly showcased its superiority. Instance "C" (the continual learning model) achieved a significantly lower average MSE compared to standard model "A" and fine-tuned model "B", indicating better adaptation to real-world data shifts. This also comes with a lower forgetting rate for instance "C" (Table 2).

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

Fitting Error MSE (e-2)
Instance A Instance B Instance C Baseline
Mean 5.138 5.442 2.829 3.190

This table compares the average Mean Squared Error (MSE) (scaled by 10210^{-2}) for three instances of the same model and a baseline across 10 wind farm forecasting datasets. Instance C, representing the CLeaR continual learning model, shows the lowest average MSE (2.829), indicating its superior performance in adapting to wind farm data compared to Instance A (standard model), Instance B (fine-tuned model), and the Baseline.

The following are the results from Table 2 of the original paper:

| | Forgetting Ratio | | | | :--- | :--- | :--- | :--- | :--- | | Instance B (AE) | Instance C (AE) | Instance B | Instance C | Mean | 1.402 | 1.171 | 3.550 | 1.161

This table presents the average forgetting ratio over 10 wind farm forecasting datasets for CLeaR model instances. Instance C (the continual learning model) demonstrates a significantly lower forgetting ratio (1.161) compared to Instance B (the fine-tuned model) (3.550). This result highlights the effectiveness of the continual learning strategies in Instance C in mitigating catastrophic forgetting for both the autoencoder (AE) component and the main predictor.

6.3. Ablation Studies / Parameter Analysis

The paper references a study [35] where CLeaR model parameterization was crucial. For highly periodic datasets, the periodicity of data could sometimes make standard models perform better than continual learning models if the CL model's plasticity was too high, leading to worse performance on periodic data. This indicates that the parameters defining a CL model's ability to understand periodicity are important. With the right selection, a continual learning model can still perform comparably to a standard model on purely periodic data, but its true advantage shines in complex, dynamic scenarios like wind farm forecasting where data shifts occur despite underlying periodicity. This implicitly suggests a form of parameter analysis where the appropriate balance of plasticity is critical.

7. Conclusion & Reflections

7.1. Conclusion Summary

This survey paper provides a foundational overview of continual learning (CL) applied to time series forecasting, a field that is still in its nascent stages compared to CL for classification tasks. The paper highlights that deep learning models, while powerful, face significant challenges in nonstationary environments due to data and concept drift, leading to catastrophic forgetting. Continual learning emerges as a promising solution, offering adaptability and stability by allowing models to continuously learn new information without compromising previously acquired knowledge.

The core of CL lies in addressing the plasticity/stability dilemma, and existing strategies generally fall into three categories: data rehearsal, architectural modification, and regularization. While these strategies have been extensively explored in classification, their application to time series forecasting is less developed. The paper identifies specific CL scenarios for time series (data domain incremental and target domain incremental learning) and discusses the merits of task-based versus task-free approaches. Experimental evidence from various applications, particularly in the energy domain, demonstrates CL's superiority over fine-tuning and its ability to approach joint training performance, often with lower forgetting rates.

7.2. Limitations & Future Work

The authors point out several key limitations and suggest future research directions:

  • Lack of Common Benchmarks: A significant limitation is the absence of standardized datasets and evaluation protocols for continual learning in time series forecasting. This hinders direct comparison and robust evaluation of new CL models, unlike the established benchmarks in classification.
  • Task Definition and Management: The concept of "tasks" in time series is often ambiguous, unlike in classification. Most current approaches use task-based arbitrary divisions, which may not reflect real-world data drifts. There is a strong need for task-free solutions where task boundaries are detected automatically or implicitly.
  • Adaptation of Complex DL Models: Future work should focus on adapting more complex and popular deep learning architectures (like encoder-decoder systems, attention mechanisms, Transformers, Autoencoders) to continual learning for time series.
  • New Metrics Development: The paper calls for the development of new, relevant metrics beyond standard regression errors. These should include refined measures for forgetting rate, plasticity/stability (e.g., adaptation speed of the loss function), and crucially, resource consumption (CPU, GPU, memory, electricity).
  • Explainability: As continual learning models become more complex, ensuring their explainability and interpretability is a growing challenge and a vital area for future research, as it helps understand model operation and builds trust in predictions.
  • Broader Application Domains: While some applications exist (energy, economics), CL for time series has immense untapped potential in fields like medicine (patient data, physiological signals) and environmental studies (meteorology, climate, hydrology).

7.3. Personal Insights & Critique

This survey serves as an excellent starting point for anyone interested in the intersection of continual learning and time series forecasting. Its beginner-friendly explanations of core CL concepts and strategies are highly valuable. The clear distinction between CL scenarios for classification versus time series is particularly insightful, highlighting the unique challenges of regression and temporal data.

One of the paper's strengths is its emphasis on the need for standardized benchmarks. The absence of such benchmarks is indeed a critical barrier to progress in this nascent field, as it makes it difficult to objectively compare and validate new CL techniques. The call for task-free solutions is also pragmatic, reflecting the dynamic and often unstructured nature of real-world time series data.

A potential area for improvement, though acknowledged by the authors as a limitation of the current literature, is the lack of detailed algorithmic descriptions specific to time series forecasting. While the survey outlines the general CL strategies, a more in-depth discussion or examples of how these strategies are concretely implemented within time series models (e.g., how regularization is applied to LSTM weights or how replay buffers are managed for sequences) would be beneficial, even if it meant referencing specific complex papers.

The paper's focus on resource consumption metrics is forward-thinking and crucial for the practical deployment of continual learning models, especially in edge computing or resource-constrained environments. The idea that CL could also be a source of explainability is an intriguing concept that warrants further exploration.

The methods and conclusions presented here can certainly be transferred to other domains dealing with streaming data or evolving concepts. For instance, in fraud detection where fraud patterns continuously change, or in robotics where robots need to adapt to new environments and tasks. The fundamental principles of balancing plasticity and stability and mitigating catastrophic forgetting are universally applicable to any AI system designed for lifelong learning. Overall, this paper effectively frames the current state and future directions of an important and challenging area of AI.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.