Continual Learning for Time Series Forecasting: A First Survey
TL;DR Summary
This survey reviews continual learning methods for time series forecasting, addressing data and concept drift, and balancing plasticity and stability to adapt models in dynamic, nonstationary environments.
Abstract
HAL Id: hal-04836655 https://hal.science/hal-04836655v1 Submitted on 13 Dec 2024 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL , est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Distributed under a Creative Commons Attribution 4.0 International License Continual Learning for Time Series Forecasting: A First Survey Quentin Besnard, Nicolas Ragot To cite this version: Quentin Besnard, Nicolas Ragot. Continual Learning for Time Series Forecasting: A First Sur- vey. ITISE 2024, Jul 2023, Gran Canaria, Spain, Spain. pp.49, 10.3390/engproc2024068049. hal- 04836655
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "Continual Learning for Time Series Forecasting: A First Survey". This indicates that the paper provides an initial overview and review of the application of continual learning techniques specifically to the domain of time series forecasting.
1.2. Authors
The authors of the paper are Quentin Besnard and Nicolas Ragot.
- Quentin Besnard is affiliated with Université de Tours, Laboratoire d'Informatique Fondamentale et Appliquée de Tours (LIFAT), 37200 Tours, France. His email is
quentin.besnard@univ-tours.fr. - Nicolas Ragot is also affiliated with Université de Tours, Laboratoire d'Informatique Fondamentale et Appliquée de Tours (LIFAT), 37200 Tours, France. His email is
nicolas.ragot@univ-tours.fr.
1.3. Journal/Conference
The paper was published in Eng. Proc. 2024, 68, 49. It was presented at the 10th International Conference on Time Series and Forecasting (ITISE 2024), held in Gran Canaria, Spain, from 15-17 July 2024.
Eng. Proc. (Engineering Proceedings) is a proceedings journal that publishes papers from various engineering conferences. While not a top-tier journal, publishing in conference proceedings is a common way to disseminate new research, especially in rapidly evolving fields. The ITISE conference is specialized in time series and forecasting, making it a relevant venue for this survey.
1.4. Publication Year
The paper was published on July 17, 2024. It was submitted on December 13, 2024, as indicated by the HAL ID entry (this might be a typo in the provided text, as publication usually follows submission, but I will stick to the provided information).
1.5. Abstract
Deep learning has achieved significant progress across various fields like robotics and image processing. However, a major challenge for neural networks is their high demand for quantitative and stationary data for proper model computation. Many real-life applications, especially in dynamic environments, cannot meet these constraints due to data and concept drift, where data distribution or goals change over time. Continual learning (CL) aims to address these issues by developing evolving models that can adapt over time, navigating the plasticity/stability dilemma while considering resource limitations. While CL efforts are seen across deep learning applications, its application to time series, particularly in regression and forecasting, is still limited. This paper serves as a first survey of continual learning applied to time series forecasting.
1.6. Original Source Link
The original source link provided is /files/papers/690053ceed47de95d44a33ed/paper.pdf.
This appears to be a local or internal file path. The HAL ID link is https://hal.science/hal-04836655v1, and the official citation link is https://doi.org/10.3390/engproc2024068049. These indicate it is an officially published work, likely part of conference proceedings.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to address is the limitation of traditional deep learning models in dynamic, real-world environments where data patterns and distributions are constantly changing. Standard neural networks require large amounts of stationary data (data whose statistical properties do not change over time) for effective training. However, in many practical scenarios, data drift (changes in data distribution) and concept drift (changes in the underlying relationship between input and output variables) are common. This non-stationarity makes pre-trained, static models quickly obsolete or inaccurate, necessitating frequent and costly retraining from scratch, which is often infeasible due to computational and data storage constraints.
This problem is important because artificial intelligence (AI) applications are increasingly deployed in nonstationary environments, such as energy management, climate prediction, and traffic analysis. The ability of AI models to adapt continuously to these evolving conditions is crucial for their long-term effectiveness and reliability. Existing research in continual learning (CL) has primarily focused on classification tasks (e.g., image recognition) and hasn't adequately addressed the unique challenges of time series forecasting where the output is a continuous value (regression) and the temporal dependency is critical.
The paper's entry point is to provide the "first survey" on continual learning specifically for time series forecasting. By doing so, it aims to consolidate the scattered knowledge in this nascent field, highlight existing approaches, identify gaps, and propose future research directions, thereby catalyzing further development.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
First Survey: It provides the first comprehensive review of
continual learningtechniques applied totime series forecasting, compiling existing works and concepts in this specific domain. -
Conceptual Clarification: It clarifies key
continual learningprinciples, including theplasticity/stability dilemmaandcatastrophic forgetting, and categorizes the main strategies to address them (rehearsal,regularization,architectural modification). -
Domain-Specific Analysis: It analyzes how
continual learningscenarios (e.g.,task incremental,domain incremental) from classification translate (or don't) totime series forecasting, introducingdata domain incremental learningandtarget domain incremental learning. -
Identification of Trends and Challenges: It identifies current research trends, such as combining
CLstrategies and biologically inspired models, and highlights significant challenges, including the lack of common benchmarks, the need fortask-free solutions, and the development of newmetricsforforgetting rate,plasticity/stability, andresource consumption. -
Application Scope: It reviews the main application domains where
CLfortime series forecastingis being explored (primarily energy, industrial maintenance, climate, traffic analysis) and emphasizes its potential impact on broader fields like medicine and environmental studies.The key conclusions and findings include:
-
Continual learningmodels demonstrate superior performance compared tofine-tuningand often approach the performance ofjoint training(an upper bound) innonstationary environmentswhile effectively mitigatingcatastrophic forgetting. -
Despite promising results, the field of
CLfortime series forecastingis nascent and lacks standardization, particularly in terms of common benchmarks and task definition. -
Existing
CLstrategies, largely adapted fromclassification, need further refinement and novel development to fully address the specific characteristics oftime series dataandregression tasks.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a foundational understanding of several key concepts is essential, particularly for a beginner.
-
Deep Learning (DL): A subfield of
machine learningthat uses artificialneural networkswith multiple layers (hence "deep") to learn complex patterns from data. Unlike traditionalmachine learningalgorithms,deep learningmodels can automatically extract features from raw data, reducing the need for manual feature engineering. It has driven significant advancements inAIin areas like image recognition, natural language processing, and robotics. -
Neural Networks (NNs): Computational models inspired by the structure and function of biological neural networks. They consist of interconnected nodes (neurons) organized in layers. Each connection has a
weightassociated with it, and neurons apply anactivation functionto the weighted sum of their inputs to produce an output. Through a process calledtraining, theweightsare adjusted to minimize the difference between the network's predictions and the actual target values. -
Time Series: A sequence of data points indexed in time order. Examples include stock prices, weather measurements, energy consumption, and sensor readings. The key characteristic is the inherent temporal dependency, meaning that past observations can influence future ones.
-
Time Series Forecasting: The process of predicting future values of a
time seriesbased on its past values and potentially other related factors. It's a type ofregressiontask where the output is a continuous number. Traditional methods includeARIMA,Exponential Smoothing, andProphet, whiledeep learningmodels likeRecurrent Neural Networks (RNNs),LSTMs,GRUs, andTransformersare increasingly used. -
Regression: A
machine learningtask where the goal is to predict a continuous numerical value. For example, predicting house prices, temperature, or energy demand. This contrasts withclassification. -
Classification: A
machine learningtask where the goal is to predict a categorical label or class. For example, classifying an image as a "cat" or "dog," or determining if an email is "spam" or "not spam." -
Nonstationary Environments: Environments where the underlying statistical properties of the data change over time. This means that patterns learned from past data may not hold true for future data, making models trained on historical data less effective over time.
-
Data Drift: A phenomenon where the statistical properties of the input data change over time. For example, in a sensor network, sensor readings might gradually shift due to environmental factors or sensor degradation. The relationship between input and output might remain the same, but the input distribution changes.
-
Concept Drift: A phenomenon where the underlying relationship between the input variables and the target variable changes over time. For example, consumer preferences for a product might shift, meaning that the features that predicted sales in the past no longer do so effectively. This is a more profound change than
data drift. -
Continual Learning (CL) (also known as
lifelong learning,incremental learning, orprogressive learning): A subfield ofmachine learningfocused on enabling models to continuously learn new tasks or new information from a stream of data without forgetting previously acquired knowledge. The goal is to build models that can adapt and evolve over their lifetime. -
Catastrophic Forgetting: A major challenge in
continual learningwhere a neural network, upon learning a new task or new data, rapidly and severely forgets knowledge acquired from previous tasks or data. This happens because updating the network'sweightsfor new information often overwrites theweightscritical for old information. -
Plasticity/Stability Dilemma: A fundamental trade-off in
continual learning.Plasticityrefers to a model's ability to quickly adapt and learn new information.Stabilityrefers to its ability to retain previously learned knowledge. A highly plastic model might be prone tocatastrophic forgetting, while a very stable model might struggle to learn new tasks.Continual learningaims to find a balance between these two conflicting objectives.
3.2. Previous Works
The paper frames its discussion around existing continual learning strategies designed to combat catastrophic forgetting, primarily developed for classification tasks, and then explores their application to time series forecasting.
The three main categories of strategies for continual learning are:
-
Rehearsal (or Data Replay) Strategies:
- Conceptual Definition: These methods aim to prevent
catastrophic forgettingby periodically re-exposing the model to a small subset of previously seen data (or synthetic data resembling it) while it learns new information. The idea is that byrehearsingpast knowledge, the model reinforces theweightsassociated with it and prevents them from being completely overwritten by new learning. - Mechanism: During the training process for a new task or data batch, a small buffer of old data samples (or distilled knowledge from them) is mixed with the new data, and the model is trained on this combined set.
- Example (Conceptual): Imagine a student learning topic A, then topic B. To prevent forgetting A, the student periodically reviews notes from topic A while studying topic B.
- Challenges: The main challenge is managing the
rehearsal buffer: deciding which past samples to store, how many, and how often to replay them. A larger buffer leads to better retention but higher memory and computational costs.Rehearsal-freeversions try to generate synthetic data or use other means to simulate replay without storing actual past samples. - Cited Works: [13, 25, 27, 29]
- Conceptual Definition: These methods aim to prevent
-
Regularization-based Strategies:
- Conceptual Definition: These methods add a penalty term to the model's
loss functionduring training. This penalty discourages significant changes toweightsthat are deemed important for previously learned tasks. The goal is to make the learning process for new taskspath-constrained, preserving theknowledgeacquired from old tasks. - Mechanism: When learning a new task, the
loss functionis modified to include a term that measures how much the currentweightupdates deviate from theweightsthat were optimal for previous tasks.Weightsthat are more "important" for past knowledge are penalized more heavily if they change significantly. - Example: Elastic Weight Consolidation (EWC) [12, 33, 34]:
- Conceptual Definition: EWC is a prominent
regularizationtechnique. It identifies whichweightsin the neural network are most crucial for previously learned tasks and then selectively slows down their learning rate when a new task is encountered. This is analogous to a human memory system where certain memories are "hardened" and resistant to change. - Mathematical Formula: The
loss functionfor learning a new task (after having learned task ) is modified as: $ L(W) = L_B(W) + \sum_i \frac{\lambda}{2} F_i (\theta_i - \theta_{A,i}^*)^2 $ - Symbol Explanation:
L(W): The totalloss functionfor the current training step, considering both the new task and preventingforgetting.- : The standard
loss functionfor the new task (e.g.,cross-entropyforclassification,MSEforregression) using the current networkweights. - : Summation over all individual
weightsin the network. - : A hyperparameter that controls the strength of the
regularizationterm. A higher means more emphasis on protecting old knowledge. - : The
Fisher Information Matrixdiagonal entry forweight. TheFisher Information Matrixestimates the importance of eachweightfor the previous task . A higher meansweightis more critical for task . - : The current value of
weight. - : The value of
weightafter learning task (i.e., the optimalweightfor task ).
- Purpose: The
regularizationterm penalizes the squared difference between the currentweightand its optimal value for task . This penalty is scaled by and , effectively creating "elastic" constraints around importantweights.
- Conceptual Definition: EWC is a prominent
- Challenges: Calculating and storing the
Fisher Information Matrixcan be computationally expensive and memory-intensive for large networks and many tasks. Also, if tasks are too disjoint,regularizationmight lead to an "average" model that performs poorly on all tasks. - Cited Works: [12, 33, 34] (for EWC), [15] (for Synaptic Intelligence, another regularization method).
- Conceptual Definition: These methods add a penalty term to the model's
-
Architectural Modification (or Model Extension) Strategies:
-
Conceptual Definition: These methods address
catastrophic forgettingby dynamically altering the network's architecture (e.g., adding or removing neurons/layers) as new tasks are learned. Instead of modifying existingweightsfor old tasks, new parts of the network are dedicated to new tasks, preserving the old parts. -
Mechanism: When a new task arrives, the model's capacity is increased by adding new neurons, layers, or even entire subnetworks. The parts of the network responsible for previous tasks are "frozen" or isolated from modification.
-
Example (Conceptual): Imagine building new rooms onto a house for new purposes, rather than constantly re-purposing existing rooms and potentially damaging their original function.
-
Challenges: This approach leads to a continuously growing model size, which can become computationally and memory-prohibitive. Techniques like
pruning(removing unnecessary connections or neurons) can help mitigate this but are complex to implement without performance degradation. -
Cited Works: [30, 31, 32]
These three main strategies can also be combined into
hybrid versionsto leverage their respective strengths, offering trade-offs between performance, memory, and computational cost [1, 17, 19, 35–37].
-
3.3. Technological Evolution
The evolution of deep learning and continual learning has followed a path of increasing complexity and specialization:
- Early Deep Learning: Focused on static datasets, achieving breakthroughs in image recognition (
CNNs) and natural language processing (RNNs,LSTMs). These models excelled when the training and test data shared the same distribution. - Emergence of Continual Learning: As
deep learningmodels moved into real-world applications, the problem ofcatastrophic forgettingin dynamic environments became apparent. This spurred research intocontinual learning, initially heavily focused onclassificationtasks, driven by standardized benchmarks likeMNISTandCIFAR. - Development of CL Strategies: The three main categories (
rehearsal,regularization,architectural) emerged as solutions, each with its own advantages and disadvantages, constantly being refined and combined. - Application to Time Series: More recently, with the increasing use of
deep learningfortime series forecasting, the challenges ofnonstationary environmentsandconcept drifthave highlighted the need forcontinual learningin this domain. This paper sits at this juncture, surveying these early exploratory efforts. This work fits within the timeline as a consolidator of early research efforts, acting as a foundational review for a relatively new subfield.
3.4. Differentiation Analysis
Compared to the main methods in related work, this paper differentiates itself not by proposing a new continual learning algorithm or model, but by being the first dedicated survey focusing specifically on the intersection of continual learning and time series forecasting.
- Focus on Time Series Forecasting: While
continual learningresearch is abundant, the vast majority of it targetsclassification tasks(e.g., image classification). This paper explicitly shifts the focus toregression tasksontime series data, highlighting the unique challenges and opportunities in this domain (e.g., specific types of drift, lack of task boundaries, temporal dependencies). - Survey Nature: Instead of introducing a novel algorithm, the paper's innovation lies in its comprehensive review, categorization, and critical analysis of existing
CLtechniques and their nascent applications intime series forecasting. It acts as a guide, consolidating scattered knowledge, identifying key concepts, outlining current trends, and pinpointing significant research gaps (like the absence of common benchmarks). - Identification of Specific Scenarios: It introduces and clarifies
data domain incremental learningandtarget domain incremental learningas the primarycontinual learningscenarios relevant totime series forecasting, distinguishing them from thetask incremental learningoften seen in classification. - Highlighting Gaps: A crucial differentiation is its emphasis on the lack of
task-free solutionsand common benchmarks, which are more prevalent inclassification CL. This sets the agenda for future research, urging the community to develop solutions tailored totime series.
4. Methodology
As a survey paper, this work does not propose a novel algorithm or model in the traditional sense. Instead, its methodology revolves around structuring and analyzing the existing literature on continual learning within the context of time series forecasting. The authors' approach involves:
- Defining Core Principles: Establishing a common understanding of
continual learningconcepts. - Categorizing Strategies: Grouping existing
continual learningmethods based on their underlying mechanisms. - Analyzing Domain-Specific Challenges: Adapting
continual learningscenarios fromclassificationtotime seriesand identifying unique problems. - Reviewing Applications and Results: Summarizing findings from applied works in
time series forecasting. - Identifying Trends and Future Work: Outlining open problems and promising research directions.
4.1. Principles
The core idea guiding the survey is the fundamental challenge of catastrophic forgetting and the plasticity/stability dilemma in deep learning models operating in nonstationary environments. The theoretical basis is that standard models, trained on fixed datasets, cannot adapt to evolving data distributions or goals without losing previously acquired knowledge. Continual learning aims to solve this by enabling models to continuously learn and adapt over time.
- Stability: A model exhibits
stabilityif it can accurately predict outcomes when the data distribution, patterns, or environment remain constant. This is ideal for fixed, perfectly sampled datasets but fails when changes occur. - Plasticity: A model exhibits
plasticityif it can continuously and quickly adapt to new tasks and data. While crucial for adaptation, excessiveplasticitycan lead tocatastrophic forgettingas new learning overwrites old knowledge. - Compromise:
Continual learningseeks to find a compromise betweenplasticityandstability, allowing the model to adapt to new information without forgetting old information, ideally within given material and computational constraints.
4.2. Core Methodology In-depth (Survey Structure and Analysis Framework)
The survey's methodological framework involves analyzing existing continual learning literature through the lens of its three main strategies, then examining how these apply to time series forecasting and the specific challenges encountered.
4.2.1. Strategies to Address Catastrophic Forgetting
The paper outlines three primary strategies to address catastrophic forgetting, which form the basis for classifying continual learning approaches:
-
Data Rehearsal (Replay):
- Principle: Constantly reintroducing samples of past data into the learning steps to adapt the model. This
stimulatesthe model withpast knowledgeto prevent its degradation. - Mechanism: A
memory bufferstores a selection of past data. When learning new data, the model is trained on a mix of new data and samples from thismemory buffer. - Advantages: Often yields the best results in terms of model accuracy for retaining old knowledge.
- Drawbacks: Leads to a perpetual increase in the amount of data the model must manage over time, increasing memory and computational costs.
Rehearsal-freeversions are being explored to mitigate this. - Research Directions: Management of data to be reintroduced (e.g., selection criteria for the buffer).
- Principle: Constantly reintroducing samples of past data into the learning steps to adapt the model. This
-
Architectural Modification (Model Extension):
- Principle: Adapting the model's architecture directly as new tasks or data are encountered. This avoids modifying parts of the network that are used for previous tasks.
- Mechanism: The size of network layers or the number of components grows over time. New tasks are learned by adding new
neurons,layers, orsubnetworks, while existingweightscorresponding to past tasks are oftenfrozen. - Advantages: No loss in performance on previously learned tasks because
weightsare preserved. - Drawbacks: Continuous growth in model size, leading to increased memory footprint and computational requirements.
Pruningmechanisms can address this but are not well-established forcontinual learning.
-
Regularization Approach:
- Principle: Penalizing the modification of
neural network weightsbased on their estimatedimportancefor previously learned tasks. The goal is to constrainweightupdates to retain prior knowledge. - Mechanism: An additional term is added to the
loss functionduring training. This term quantifies theimportanceof eachweightfor past tasks (e.g., usingFisher Information) and penalizes changes toimportant weights. - Example (EWC as explained in Foundational Concepts):
$
L(W) = L_{new_task}(W) + \sum_i \frac{\lambda}{2} F_i (\theta_i - \theta_{old_task,i}^*)^2
$
where is the
lossfor the new task, is theregularizationstrength, is theFisher informationforweight, is the currentweightvalue, and is the optimalweightvalue for previous tasks. - Advantages: Does not imply network growth or data replay, making it potentially more scalable in the long run.
- Drawbacks: Can still exhibit some
forgettingiftasksare too disjoint or theregularizationis not perfectly tuned. It might lead to a model that is an "average" performer across many diverse tasks. - Hybrid Versions: The paper notes that these three strategies can be combined (e.g.,
regularizationwithdata replay) to achieve better performance by leveraging their respective strengths.
- Principle: Penalizing the modification of
4.2.2. Continual Learning Scenarios for Time Series Forecasting
The paper highlights that continual learning research in classification typically focuses on three scenarios:
-
Task Incremental Learning: Sequentially learning to solve distinct tasks, where a
taskis defined by a goal within a context. -
Class Incremental Learning: Discriminating between incrementally observed classes.
-
Domain Incremental Learning: Learning to solve the same problem in different contexts (e.g., recognizing the same objects in different lighting conditions).
For
time series forecasting, the paper points out that thetask incremental learningscenario (as defined for classification) is generally not addressed. Instead, two main scenarios are identified, as outlined in [38]:
- Incremental Learning of the Data Domain:
- Principle: The underlying data generation process changes over time due due to the
nonstationarityof thedata stream. The distribution of the data relative to the same objective varies. - Example: Predicting energy consumption, where consumption patterns shift due to seasonal changes or new appliance usage over time, but the goal (predicting consumption) remains the same. This is equivalent to
data driftorconcept driftwhere the prediction target remains constant.
- Principle: The underlying data generation process changes over time due due to the
- Incremental Learning of the Target Domain:
- Principle: The output of the model varies over time, meaning the number or properties of prediction targets change.
- Example: A model initially predicting single-day-ahead temperature might later need to predict multi-day-ahead temperatures, or new variables might need to be predicted (e.g., adding humidity alongside temperature). This involves changes in the
prediction horizonor theoutput space.
4.2.3. Task Management Approaches
The paper further distinguishes how tasks (or evolving data scenarios) are managed in continual learning for time series:
- Task-based Approaches:
- Principle: The dataset is arbitrarily separated into fixed-size subsets, each corresponding to a "new task." This is common for evaluation of
CLsystems. - Drawbacks: This arbitrary separation may not represent real-world transitions between tasks and primarily focuses on incremental learning through data progression, leaving
catastrophic forgettingsolely to theCL strategyemployed. It often relies on external supervision (human or machine) to define task boundaries. - Cited Works: [24, 27, 29, 39–46]
- Principle: The dataset is arbitrarily separated into fixed-size subsets, each corresponding to a "new task." This is common for evaluation of
- Task-free Approaches:
- Principle: The model continuously learns without explicit, predefined task boundaries.
Taskchanges are detected automatically, often based on internal cues or data characteristics. - Advantages: More realistic for dynamic environments where explicit task boundaries are unknown or fluid. Better
task management(e.g., through unsupervised detection of changes) can help inforgetting managementand balancingplasticity/stability. - Cited Works: [35, 47] (e.g., [47] proposes
task detectionthroughloss analysis, [35] uses anovelty data buffer).
- Principle: The model continuously learns without explicit, predefined task boundaries.
4.2.4. Application Domains
The survey identifies that continual learning for forecasting is predominantly applied in the energy domain (e.g., load prediction, renewable energy forecasting) [24, 35, 36, 39–41, 46]. Other applications include industrial maintenance [36, 43], climatic analysis [36, 42, 45], and traffic analysis [27, 36]. The diversity of datasets across these applications currently hinders direct model comparisons.
5. Experimental Setup
As a survey paper, this work does not present its own experimental setup. Instead, it discusses the types of datasets, evaluation metrics, and baselines used in the continual learning for time series forecasting literature it reviews. The authors use results from existing papers to illustrate the effectiveness of continual learning approaches.
5.1. Datasets
The paper notes that most research on continual learning for regression tasks (time series forecasting) focuses on specific applications, often using distinct datasets.
- Sources: The application domains mentioned include:
- Energy Domain:
Energy load prediction,wind farm energy production[24, 35, 36, 39–41, 46]. These datasets typically involve sequences of energy consumption or generation over time, often exhibiting daily and seasonalperiodicitybut also subject toclimatic variationsanddata shifts. - Industrial Maintenance: Data related to equipment performance, sensor readings, or failure rates [36, 43].
- Climatic Analysis: Meteorological data, greenhouse conditions [36, 42, 45].
- Traffic Analysis: Traffic flow data [27, 36].
- Energy Domain:
- Characteristics: These datasets are primarily
time seriesdata, characterized by temporal dependencies, potentialperiodicity, and susceptibility tononstationarity(e.g.,data driftorconcept drift). - Lack of Standardization: A significant observation by the authors is the
lack of common benchmarksin thecontinual learningliterature fortime series prediction. This diversity of datasets makes direct comparison between differentCLmodels and algorithms challenging. - Example of data behavior: While no concrete data samples are provided in the survey itself, the discussion around
wind farm energy production(Experiment 1 in [35]) impliestime seriesof power output, which is influenced bywind speedand otherclimatic factors, showcasingdailyandseasonal periodicitywith additionalclimatic variationscausingdata shifts.
5.2. Evaluation Metrics
The paper discusses common regression metrics and introduces a specific continual learning metric.
-
Mean Squared Error (MSE):
- Conceptual Definition:
MSEmeasures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. It is a common metric used to quantify the magnitude of error inregressiontasks. A lowerMSEindicates a better fit of the model to the data. It penalizes large errors more heavily than small errors due to the squaring operation. - Mathematical Formula: $ \mathrm{MSE} = \frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 $
- Symbol Explanation:
- : The total number of data points (observations).
- : The actual (observed) value for the -th data point.
- : The predicted value for the -th data point.
- : The sum over all data points.
- Conceptual Definition:
-
R-squared ():
- Conceptual Definition: (coefficient of determination) is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a
regressionmodel. It indicates how well the model predicts the outcome. Values range from 0 to 1, with higher values indicating a better fit. (Mentioned in Figure 1 of the paper, implicitly as a performance metric forforecasting). - Mathematical Formula: $ R^2 = 1 - \frac{\sum_{i=1}^{n} (Y_i - \hat{Y}i)^2}{\sum{i=1}^{n} (Y_i - \bar{Y})^2} $
- Symbol Explanation:
- : The total number of data points (observations).
- : The actual (observed) value for the -th data point.
- : The predicted value for the -th data point.
- : The mean of the actual (observed) values.
- : The sum over all data points.
- Conceptual Definition: (coefficient of determination) is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a
-
Forgetting Ratio:
- Conceptual Definition: This metric specifically quantifies the degree to which a
continual learningmodel forgets previously learned knowledge when learning new tasks. A lowerforgetting ratiois desirable, indicating better retention of old information. It is crucial for evaluatingcontinual learningstrategies. - Mathematical Formula: As provided in the paper (Equation (1)): $ forgetting~ratio = \frac{\mathrm{max}(0, L_{warm_up}^2 - L_{warm_up}^1)}{L_{warm_up}^1} $
- Symbol Explanation:
- : The
Mean Squared Error (MSE)on a specificwarm-up dataset(a dataset representing prior knowledge or tasks) at the end of the initialwarm-up phase(initial training period). - : The
MSEon the same warm-up dataset after anupdate phase(learning new tasks/data). - : This function ensures that the
forgetting ratiois non-negative. If the error on the old task decreases after learning new tasks (meaning the model actually improved on the old task), theforgetting ratiois set to 0, indicating noforgetting. - Purpose: This formula measures the increase in error on old tasks after learning new ones, normalized by the initial error on those old tasks. An increase in error means
forgettinghas occurred.
- : The
- Conceptual Definition: This metric specifically quantifies the degree to which a
5.3. Baselines
The models against which continual learning approaches are typically compared in the surveyed literature include:
- Fine-tuning:
- Conceptual Definition: A standard approach where a pre-trained model (trained on older data) is further trained on new data without specific mechanisms to prevent
catastrophic forgetting. The entire model'sweightsare updated based on the new data. - Why it's a baseline: It represents the performance of a naive adaptation strategy, often leading to significant
forgettingof old tasks.
- Conceptual Definition: A standard approach where a pre-trained model (trained on older data) is further trained on new data without specific mechanisms to prevent
- Joint Training (or Joint Learning):
- Conceptual Definition: The model is trained on all available data (old and new) simultaneously, as if all data were available from the beginning.
- Why it's a baseline: This represents an
upper boundof performance forcontinual learningbecause it has access to a complete and representative dataset.Continual learningmodels aim to approachjoint trainingperformance without its computational and memory demands.
- Standard Model (without CL strategies):
- Conceptual Definition: A neural network model trained incrementally on new data without any
continual learningmechanisms (i.e., norehearsal,regularization, orarchitectural modification). - Why it's a baseline: This serves as a direct comparison point to demonstrate the efficacy of
CL strategiesin mitigatingcatastrophic forgetting.
- Conceptual Definition: A neural network model trained incrementally on new data without any
6. Results & Analysis
The survey compiles and analyzes findings from various research papers to demonstrate the effectiveness and challenges of continual learning in time series forecasting. The analysis primarily relies on performance metrics like MSE and R-squared, and forgetting rate.
6.1. Core Results Analysis
The main experimental results presented across the surveyed literature strongly validate the effectiveness of continual learning approaches compared to standard fine-tuning for time series forecasting in nonstationary environments.
-
Superiority over Fine-tuning:
Continual learningmodels consistently outperformfine-tuningby retaining knowledge from past tasks.Fine-tuning, withoutCLmechanisms, often exhibits a significant loss of precision on previously learned tasks. As shown in Figure 1,lifelong learning(a form ofcontinual learning) maintains performance close to theupper boundset byjoint training, whereasfine-tuningsuffers from performance degradation on older tasks (e.g., task A, B) as new tasks (C) are learned.![Figure 1. Comparison of fine-tuning, lifelong learning, and joint training for sequential task learning A (blue) \( \\mathtt { B }\) (green) \( \\mathbf { C }\) (red)—Figure 11— \[24\].](/files/papers/690053ceed47de95d44a33ed/images/1.jpg)
该图像是图表,展示了图1中用于顺序任务学习的微调、终身学习和联合训练三种方法在任务A、B、C上随训练轮数的表现对比,纵轴为预测性能指标,横轴为训练轮数。Figure 1. Comparison of fine-tuning, lifelong learning, and joint training for sequential task learning A (blue) (green) (red)—Figure 11— [24].
-
Approaching Joint Training Performance:
Continual learningmodels, particularly those employingrehearsalor effectiveregularization, can achieve performance levels close tojoint training.Joint trainingrepresents an ideal scenario where all data is available simultaneously, thusCLaims to mimic this performance without its resource demands. -
Lower
Forgetting Rate: Studies explicitly measuring theforgetting rateconfirm thatcontinual learningmodels significantly reducecatastrophic forgettingcompared to baseline models withoutCLstrategies. Figure 3 illustrates this, showing that variousCLalgorithms (LWF, EWC, O-EWC, SI) maintain a much lowerforgetting rateas the number of tasks increases, in contrast to a standard baseline. This directly addresses theplasticity/stability dilemmaby enabling adaptation without severe knowledge loss.![Figure 3. Performance of algorithms with increasing number of tasks by 2 in each step in the task and data domain incremental scenario—Figures 3 and 4— \[41\].](/files/papers/690053ceed47de95d44a33ed/images/3.jpg)
该图像是图表,展示了在数据域增量场景和任务域增量场景中,随着训练任务数量增加,对不同算法遗忘率的影响。横轴为训练任务数量,纵轴为遗忘率,包含Baseline、LWF、EWC、Online-EWC和SI五种算法性能对比。Figure 3. Performance of algorithms with increasing number of tasks by 2 in each step in the task and data domain incremental scenario—Figures 3 and 4— [41].
-
Improved Accuracy in Dynamic Environments: As illustrated in Figure 2,
continual learningmodels (LWF, EWC, O-EWC, SI) generally achieve lowerMSEvalues compared to abaselinemodel indataandtarget domain incremental scenarios. This suggests thatCLis effective in adapting to evolving data distributions and changing prediction targets, even though it may come at the expense of increased training time.![Figure 2. Comparison of training time and average MSE on test datasets over 20 experiments for algorithms in the data and target domain scenario—Figures 1 and 2— \[41\].](/files/papers/690053ceed47de95d44a33ed/images/2.jpg)
该图像是图表,对比了数据域增量场景和任务域增量场景中多种算法在测试数据集上的平均均方误差(MSE)与训练时间(秒)。图中用不同颜色和形状标识Baseline、LWF、EWC、Online EWC和SI算法,展示它们在20次实验中的表现差异。Figure 2. Comparison of training time and average MSE on test datasets over 20 experiments for algorithms in the data and target domain scenario—Figures 1 and 2— [41].
-
Specific Case of CLeaR Model: The CLeaR model [35] demonstrates the practical advantage of
continual learninginwind farm forecasting. While it showed mixed results on purely periodic artificial data, its application to real-worldwind farm data(which hasperiodicitybut alsoclimatic variationsanddata shifts) clearly showcased its superiority. Instance "C" (thecontinual learningmodel) achieved a significantly lower averageMSEcompared to standard model "A" andfine-tunedmodel "B", indicating better adaptation to real-worlddata shifts. This also comes with a lowerforgetting ratefor instance "C" (Table 2).
6.2. Data Presentation (Tables)
The following are the results from Table 1 of the original paper:
| Fitting Error MSE (e-2) | ||||
| Instance A | Instance B | Instance C | Baseline | |
| Mean | 5.138 | 5.442 | 2.829 | 3.190 |
This table compares the average Mean Squared Error (MSE) (scaled by ) for three instances of the same model and a baseline across 10 wind farm forecasting datasets. Instance C, representing the CLeaR continual learning model, shows the lowest average MSE (2.829), indicating its superior performance in adapting to wind farm data compared to Instance A (standard model), Instance B (fine-tuned model), and the Baseline.
The following are the results from Table 2 of the original paper:
| | Forgetting Ratio | | | | :--- | :--- | :--- | :--- | :--- | | Instance B (AE) | Instance C (AE) | Instance B | Instance C | Mean | 1.402 | 1.171 | 3.550 | 1.161
This table presents the average forgetting ratio over 10 wind farm forecasting datasets for CLeaR model instances. Instance C (the continual learning model) demonstrates a significantly lower forgetting ratio (1.161) compared to Instance B (the fine-tuned model) (3.550). This result highlights the effectiveness of the continual learning strategies in Instance C in mitigating catastrophic forgetting for both the autoencoder (AE) component and the main predictor.
6.3. Ablation Studies / Parameter Analysis
The paper references a study [35] where CLeaR model parameterization was crucial. For highly periodic datasets, the periodicity of data could sometimes make standard models perform better than continual learning models if the CL model's plasticity was too high, leading to worse performance on periodic data. This indicates that the parameters defining a CL model's ability to understand periodicity are important. With the right selection, a continual learning model can still perform comparably to a standard model on purely periodic data, but its true advantage shines in complex, dynamic scenarios like wind farm forecasting where data shifts occur despite underlying periodicity. This implicitly suggests a form of parameter analysis where the appropriate balance of plasticity is critical.
7. Conclusion & Reflections
7.1. Conclusion Summary
This survey paper provides a foundational overview of continual learning (CL) applied to time series forecasting, a field that is still in its nascent stages compared to CL for classification tasks. The paper highlights that deep learning models, while powerful, face significant challenges in nonstationary environments due to data and concept drift, leading to catastrophic forgetting. Continual learning emerges as a promising solution, offering adaptability and stability by allowing models to continuously learn new information without compromising previously acquired knowledge.
The core of CL lies in addressing the plasticity/stability dilemma, and existing strategies generally fall into three categories: data rehearsal, architectural modification, and regularization. While these strategies have been extensively explored in classification, their application to time series forecasting is less developed. The paper identifies specific CL scenarios for time series (data domain incremental and target domain incremental learning) and discusses the merits of task-based versus task-free approaches. Experimental evidence from various applications, particularly in the energy domain, demonstrates CL's superiority over fine-tuning and its ability to approach joint training performance, often with lower forgetting rates.
7.2. Limitations & Future Work
The authors point out several key limitations and suggest future research directions:
- Lack of Common Benchmarks: A significant limitation is the absence of standardized datasets and evaluation protocols for
continual learningintime series forecasting. This hinders direct comparison and robust evaluation of newCLmodels, unlike the established benchmarks inclassification. - Task Definition and Management: The concept of "tasks" in
time seriesis often ambiguous, unlike inclassification. Most current approaches usetask-basedarbitrary divisions, which may not reflect real-worlddata drifts. There is a strong need fortask-free solutionswhere task boundaries are detected automatically or implicitly. - Adaptation of Complex DL Models: Future work should focus on adapting more complex and popular
deep learningarchitectures (likeencoder-decoder systems,attention mechanisms,Transformers,Autoencoders) tocontinual learningfortime series. - New Metrics Development: The paper calls for the development of new, relevant
metricsbeyond standardregressionerrors. These should include refined measures forforgetting rate,plasticity/stability(e.g., adaptation speed of the loss function), and crucially,resource consumption(CPU, GPU, memory, electricity). - Explainability: As
continual learningmodels become more complex, ensuring theirexplainabilityandinterpretabilityis a growing challenge and a vital area for future research, as it helps understand model operation and builds trust in predictions. - Broader Application Domains: While some applications exist (energy, economics),
CLfortime serieshas immense untapped potential in fields likemedicine(patient data, physiological signals) andenvironmental studies(meteorology, climate, hydrology).
7.3. Personal Insights & Critique
This survey serves as an excellent starting point for anyone interested in the intersection of continual learning and time series forecasting. Its beginner-friendly explanations of core CL concepts and strategies are highly valuable. The clear distinction between CL scenarios for classification versus time series is particularly insightful, highlighting the unique challenges of regression and temporal data.
One of the paper's strengths is its emphasis on the need for standardized benchmarks. The absence of such benchmarks is indeed a critical barrier to progress in this nascent field, as it makes it difficult to objectively compare and validate new CL techniques. The call for task-free solutions is also pragmatic, reflecting the dynamic and often unstructured nature of real-world time series data.
A potential area for improvement, though acknowledged by the authors as a limitation of the current literature, is the lack of detailed algorithmic descriptions specific to time series forecasting. While the survey outlines the general CL strategies, a more in-depth discussion or examples of how these strategies are concretely implemented within time series models (e.g., how regularization is applied to LSTM weights or how replay buffers are managed for sequences) would be beneficial, even if it meant referencing specific complex papers.
The paper's focus on resource consumption metrics is forward-thinking and crucial for the practical deployment of continual learning models, especially in edge computing or resource-constrained environments. The idea that CL could also be a source of explainability is an intriguing concept that warrants further exploration.
The methods and conclusions presented here can certainly be transferred to other domains dealing with streaming data or evolving concepts. For instance, in fraud detection where fraud patterns continuously change, or in robotics where robots need to adapt to new environments and tasks. The fundamental principles of balancing plasticity and stability and mitigating catastrophic forgetting are universally applicable to any AI system designed for lifelong learning. Overall, this paper effectively frames the current state and future directions of an important and challenging area of AI.
Similar papers
Recommended via semantic vector search.