Paper status: completed

Robust regression for electricity demand forecasting against cyberattacks

Published:12/10/2022

Electricity Load Forecasting (1)Robust Regression Methods (1)Cyberattack Detection (1)Outlier Detection (1)Data-Driven Parameter Tuning (1)

Original Link

Price: 0.100000

7 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study proposes data-driven robust regression, especially adaptive trimmed regression, to enhance outlier detection and electricity demand forecasting against cyberattacks, outperforming fixed-parameter methods and improving prediction reliability and system stability.

Abstract

International Journal of Forecasting 39 (2023) 1573–1592 Contents lists available at ScienceDirect International Journal of Forecasting journal homepage: www.elsevier.com/locate/ijforecast Robust regression for electricity demand forecasting against cyberattacks Daniel VandenHeuvel a , Jinran Wu b , a , You-Gan Wang b , a , ∗ a Queensland University of Technology, Brisbane, Queensland 4001, Australia b Australian Catholic University, Brisbane 4000, Australia a r t i c l e i n f o Keywords: Robust estimate Data-driven Outliers Regression Cyberattack Data integrity a b s t r a c t Standard methods for forecasting electricity loads are not robust to cyberattacks on electricity demand data, potentially leading to severe consequences such as major economic loss or a system blackout. Methods are required that can handle forecasting under these conditions and detect outliers that would otherwise go unnoticed. The key challenge is to remove as many outliers as possible while maintaining enough clean data to use in the regression. In this paper we investigate robust approaches with data- driven tuning parameters, and in particular present an adapt

Mind Map

In-depth Reading

English Analysis~29 min read · 41,045 chars

1. Bibliographic Information

1.1. Title

The title of the paper is: Robust regression for electricity demand forecasting against cyberattacks.

1.2. Authors

The authors of the paper are:

Daniel VandenHeuvel (a)
Jinran Wu (b,a)
You-Gan Wang (b,a)

Their affiliations are:
(a) Queensland University of Technology, Brisbane, Queensland 4001, Australia
(b) Australian Catholic University, Brisbane 4000, Australia

1.3. Journal/Conference

The paper was published in the International Journal of Forecasting. This journal is a highly reputable and influential publication in the field of forecasting, indicating that the research has undergone rigorous peer review and is recognized for its quality and relevance within the forecasting community.

1.4. Publication Year

The publication year is 2022.

1.5. Abstract

Standard methods for forecasting electricity loads are vulnerable to cyberattacks on demand data, which can lead to severe consequences such as significant economic losses or system blackouts. The paper addresses the need for forecasting methods that can operate robustly under such conditions and effectively detect outliers. The primary challenge lies in efficiently removing outliers while preserving enough clean data for accurate regression. This research investigates robust approaches that incorporate data-driven tuning parameters. Specifically, it introduces an adaptive trimmed regression method designed to improve outlier detection and forecasting accuracy. The paper concludes that data-driven approaches generally outperform their counterparts with fixed tuning parameters and provides recommendations for future research.

1.6. Original Source Link

The original source link provided is: /files/papers/690ffe55f205bb3597edd086/paper.pdf. Based on the context, this appears to be a direct link to the PDF of the paper, indicating its publication status.

2. Executive Summary

2.1. Background & Motivation

The core problem addressed by this paper is the vulnerability of traditional electricity load forecasting methods to cyberattacks. Electricity grids are critical infrastructure, and reliable load forecasting is essential because electricity cannot be easily stored in large quantities. Inaccurate forecasts, especially those skewed by malicious data manipulation, can lead to severe consequences like major economic losses (due to oversupply) or system blackouts (due to undersupply).

The motivation stems from the increasing complexity of modern power systems, which makes them more susceptible to cyberattacks. Recent real-world incidents, such such as the 2015 Ukraine power system attack that caused widespread blackouts, underscore this vulnerability. Existing forecasting methods, while effective under normal conditions, often lack the robustness to handle contaminated data resulting from such attacks.

The specific challenge identified is the need to develop methods that can effectively detect and remove outliers introduced by cyberattacks without discarding too much valid data, thereby maintaining forecasting accuracy. The paper's entry point is to investigate and improve upon existing robust regression techniques, specifically focusing on incorporating data-driven tuning parameters to enhance their adaptability and performance in adversarial environments.

2.2. Main Contributions / Findings

The paper makes several key contributions:

Improved Adaptive Trimmed Regression: It introduces an adaptive trimmed regression method that extends and improves upon Bacher's ALTS (Adaptive Least Trimmed Squares) algorithm. This extension incorporates a robust estimator for the variance of the data and a new estimator for the proportion of outliers, directly considering the distribution of clean data.
Data-Driven Tuning Parameters: The research emphasizes and demonstrates the superior performance of data-driven tuning parameters for robust regression methods compared to their fixed-parameter counterparts. This adaptability is crucial for handling varying attack scenarios.
Enhanced Outlier Detection and Forecasting: The proposed method offers better detection of outliers and provides improved forecasts, particularly in scenarios with highly contaminated data.
Superior Performance against Cyberattacks:
- For random attacks, the new method consistently outperforms Bacher's ALTS and other M-estimation methods (both fixed and data-driven parameters), especially when the proportion of outliers is high.
- For ramp attacks, the new method also generally surpasses Bacher's ALTS. While the data-driven bisquare method shows superiority for smaller-scale ramp attacks or those with large attack windows, the proposed method remains optimal for larger-scale contaminations.
Robustness in Clean Data Scenarios: The method maintains competitive performance even when there are no outliers ( $p=1$ ), suggesting it can be a reliable default for forecasting, reducing concerns about overestimating demand when robust methods might not seem necessary.
Addressing Limitations of Prior Work: The new method addresses specific shortcomings of Bacher's ALTS, such as its tendency to underestimate the proportion of attacked data and its use of a non-robust variance estimator.

The findings demonstrate that the proposed adaptive trimmed regression method significantly enhances the robustness and accuracy of electricity demand forecasting in the presence of cyberattacks, offering a more reliable tool for energy system operators.

3.1. Foundational Concepts

To fully grasp the contributions of this paper, a reader should be familiar with several foundational concepts:

Electricity Load Forecasting: This refers to predicting future electricity demand. It's critical for power system operation, ensuring a balance between supply and demand, and for economic decision-making. Key challenges include the non-storability of electricity in large quantities and the volatile nature of demand (influenced by weather, time of day, day of week, etc.).
Cyberattacks on Power Systems: These are malicious attempts to disrupt, manipulate, or gain unauthorized access to energy infrastructure systems. In the context of this paper, the focus is on data integrity attacks, where the integrity of demand data is compromised, leading to false readings and erroneous forecasts. Examples like the Ukraine power system attack highlight the real-world impact of such attacks. The paper specifically discusses random attacks (randomly scaling data points) and ramp attacks (gradually increasing/decreasing load over an interval).
Robust Statistics / Robust Regression: In classical statistics, methods like Ordinary Least Squares (OLS) are highly sensitive to outliers (data points that significantly deviate from the majority of the data). Robust statistics aims to develop methods that are less affected by such deviations. Robust regression specifically focuses on fitting regression models that are resistant to outliers in the response variable ( $y$ ) or in the explanatory variables ( $\mathbf{x}$ ). This is crucial in adversarial settings like cyberattacks, where outliers are intentionally introduced.
M-estimators: These are a class of robust estimators that generalize the concept of Maximum Likelihood Estimators (MLEs). Instead of minimizing the sum of squared residuals (as in OLS) or maximizing a likelihood function based on a Gaussian assumption, M-estimators minimize a sum of a less rapidly increasing function of the residuals. They are defined by a weight function ( $\psi$ $ψ$ ) that downweights the influence of large residuals.
- Huber's weight function ( $\psi_H$ ): Linearly penalizes small residuals and applies a constant penalty to residuals exceeding a certain threshold ( $k$ ). This means it treats moderately large residuals like OLS but limits the influence of extremely large ones.
- Bisquare weight function ( $\psi_B$ ): Applies a decreasing weight to larger residuals, eventually giving zero weight to residuals beyond a certain threshold ( $c$ ). This effectively removes extreme outliers from the estimation process.
Least Trimmed Squares (LTS): An LTS estimator is a highly robust regression method that works by considering only a subset of the data points, specifically those with the smallest squared residuals. It aims to find the regression coefficients that minimize the sum of the smallest $h$ squared residuals, where $h$ is a pre-defined number of "clean" observations. This implicitly discards the (n-h) largest residuals, which are assumed to be outliers.
Order Statistics: These are the elements of a random sample arranged in ascending order. For a set of residuals $|e_1|, \dots, |e_n|$ , the ith order statistic $|e|_{(i)}$ is the $i$ -th smallest absolute residual. They are fundamental to methods like LTS and the proposed adaptive method, which rely on identifying and trimming data based on residual magnitudes.
Mean Absolute Percentage Error (MAPE): A common metric used to evaluate the accuracy of forecasts. It measures the average magnitude of error in terms of percentage.
Normal Distribution ( $\mathcal{N}(\mu, \sigma^2)$ ) and Folded Normal Distribution ( $\mathcal{FN}(\mu, \sigma^2)$ ): The normal distribution is a symmetric, bell-shaped probability distribution widely used in statistics. A folded normal distribution arises when the absolute value of a normally distributed random variable is taken. The paper uses the properties of these distributions, particularly the $\mathcal{FN}(0, \sigma^2)$ distribution for absolute residuals, to derive its robust variance estimator and outlier detection mechanism.

3.2. Previous Works

The paper contextualizes its work by reviewing existing methods for electricity load forecasting and robust regression, highlighting the limitations that its proposed method aims to address.

General Electricity Load Forecasting Methods:
- Hybrid methods: Combine different models for different scales (e.g., short-term and long-term forecasts) (Cho et al., 2013; Lu et al., 2021; Motamedi et al., 2012; Wan et al., 2013).
- Generalized Additive Models (GAMs): Decompose electricity load into a sum of unknown functions to handle non-linearity (Gaillard et al., 2016; Goude et al., 2013; Kanda & Veguillas, 2019). The paper notes that standard GAMs are not robust to outliers, though robust extensions exist (Correia & Abebe, 2021).
- Artificial Intelligence and Neural Network methods: Such as deep learning (Bedi & Toshniwal, 2019), support vector machines (Jiang et al., 2020; Lu et al., 2021), and neural networks (Bianchi et al., 2017; Charytoniuk & Chen, 2000; Dudek, 2016) have become popular for their ability to model complex non-linear relationships, albeit with increased computation and decreased interpretability.
Robust Forecasting for Cyberattacked Data: This is the most direct line of prior work:
- Luo et al. (2018a): Benchmarked multiple linear regression, support vector regression, artificial neural networks, and fuzzy interaction regression for attacked data. They found support vector regression to be optimal among these, but none could handle significantly large attacks.
- Luo et al. (2022): Developed a robust support vector regression method that is more robust than $L_1$ regression (Luo et al., 2019) or iteratively re-weighted least squares (IRLS), especially for large-scale cyberattacks.
- $L_1$ Regression (Least Absolute Deviations): Minimizes the sum of absolute residuals, making it more robust than OLS (which minimizes squared residuals) to outliers. It's often used as a robust starting point.
- Iteratively Re-weighted Least Squares (IRLS): A general algorithm used to solve M-estimation problems and also LTS. It iteratively re-estimates weights for each data point based on its residual, then performs a weighted least squares regression.
- Jiao et al. (2022): Applied an adaptive least trimmed squares (ALTS) method developed by Bacher et al. (2016) to robustly estimate both the proportion of attacked data and regression coefficients for electricity load forecasting under cyberattacks. This paper directly builds on Jiao et al. (2022)'s work by improving Bacher's ALTS.
- Bacher et al. (2016) Adaptive Least Trimmed Squares (ALTS): This method was a key inspiration. It iteratively estimates the proportion of uncontaminated data ( $p$ ) needed for LTS, addressing the drawback of needing an a priori estimate for $p$ . This paper identifies and aims to resolve several issues with Bacher's ALTS, such as its tendency to underestimate $p$ and the non-robustness of its variance estimator.
- Robust Time Series and State-Space Methods: Chakhchoukh et al. (2010), Wang et al. (2019b), and Zeng and Li (2021) have used these, but not specifically for cyberattacked data.
- Real-time Anomaly Detection: Luo et al. (2018b) provided robust very short-term load forecasts using real-time anomaly detection.

3.3. Technological Evolution

The evolution of electricity load forecasting has moved from simpler statistical models (like multiple linear regression) to more complex, non-linear models (GAMs, AI/NN) as computational power increased and data availability grew. Initially, the focus was primarily on improving accuracy under normal operating conditions. However, with the increasing interconnectedness and digitization of power grids, the threat of cyberattacks has emerged as a critical concern. This has driven the need for robust forecasting methods—those that can maintain accuracy even when input data is deliberately manipulated or corrupted.

This paper fits into this evolution by pushing the boundaries of robust statistical methods. While previous robust methods (like $L_1$ regression, SVR) were explored, and ALTS was applied (Jiao et al., 2022), this work refines the core ALTS algorithm itself. It moves beyond simply applying robust techniques to developing more robust and data-adaptive versions of these techniques, specifically tailored to the unique challenges of cyberattack scenarios where the nature and extent of data contamination are unknown.

3.4. Differentiation Analysis

The core differences and innovations of this paper's approach, particularly compared to the most directly related work (Jiao et al., 2022, which used Bacher's ALTS), are:

Addressing Bacher's ALTS Limitations: The paper explicitly identifies and aims to overcome four key issues with Bacher's ALTS:
- Underestimation of $p$ : Bacher's ALTS often underestimates the true proportion of clean data, potentially leading to practitioners overestimating required electricity supply and incurring financial losses. The new method provides improved estimates of $p$ that only slightly underestimate its true value.
- Non-robust Variance Estimator: The estimator for $\hat{\sigma}^2$ in Bacher's ALTS (Equation 10) is not robust when the estimated proportion of clean data exceeds the true proportion.
- Reliance on Non-robust $\hat{\sigma}^2$ for $\hat{p}$ : The condition for computing $\hat{p}$ (Equation 11) in Bacher's ALTS relies on this non-robust $\hat{\sigma}^2$ .
- Biased $\hat{\sigma}^2$ : The estimate for $\hat{\sigma}^2$ in Bacher's ALTS is biased to be too small and inconsistent.
Novel Robust Variance Estimator: The new method introduces a modified $\hat{\sigma}^2$ estimator (Equation 15) that is more robust, as it uses only the smallest 25% of residuals ( $\lfloor n/4 \rfloor$ ) to ensure only clean data are used, based on the asymptotic distribution of order statistics.
Improved Outlier Proportion Estimation: The new method's computation of $\hat{p}$ (Algorithm 1, line 8) is based on finding where the $s_i^2$ statistic deviates from its asymptotic mean (Equation 12), making it less reliant on the accuracy of $\hat{\sigma}^2$ and more robust against outliers. It also incorporates a tuning parameter $q$ for added flexibility.
Emphasis on Data-Driven Tuning Parameters: While Jiao et al. (2022) primarily used Bacher's ALTS (which adaptively estimates $p$ ), their other comparisons (Huber, Bisquare) often used fixed tuning parameters. This paper explicitly demonstrates and advocates for the superior performance of all data-driven methods (including data-driven M-estimators) over their fixed-parameter counterparts.

In essence, this paper doesn't just apply an existing robust method; it meticulously dissects and improves the underlying mechanism of a prominent adaptive robust method (ALTS) and systematically compares it with a wider array of data-driven robust alternatives, providing a more comprehensive and effective solution for electricity demand forecasting under cyberattacks.

4. Methodology

4.1. Principles

The core principle behind this paper's methodology is to enhance the robustness of regression models, particularly for electricity demand forecasting, against data contamination arising from cyberattacks. This is achieved by developing a more refined Adaptive Least Trimmed Squares (ALTS) method. The intuition is that instead of relying on fixed, pre-determined parameters for outlier detection and trimming (which may not adapt well to unknown attack magnitudes), the model should adaptively learn the proportion of clean data and the true variance of the clean data directly from the input.

The theoretical basis of the new method stems from statistical properties of order statistics of residuals. Assuming that clean data residuals follow a normal distribution, their absolute values will follow a folded normal distribution. By analyzing the asymptotic distribution of the sum of squared order statistics of these absolute residuals ( $s_i^2$ ), the method can identify the point at which residuals start to deviate from this expected clean distribution, thus signaling the presence of outliers. This statistical insight allows for a more principled and data-driven approach to estimate the proportion of clean data ( $p$ ) and the variance ( $\sigma^2$ ) of the uncontaminated residuals.

4.2. Core Methodology In-depth (Layer by Layer)

The paper begins by establishing the context of multiple linear regression, which is the foundation for the forecasting model: $ y_i = \mathbf{x}_i^\top \pmb{\beta} + \sigma \varepsilon_i, \quad i = 1, \ldots, n, $ where $\mathbf{x}_i = (1, x_{i1}, \dots, x_{i,k-1})^\top$ is the vector of $k$ predictors for the $i$ -th observation, $\pmb{\beta} = (\beta_1, \ldots, \beta_k)^\top$ is the vector of regression coefficients, $\sigma^2 = \mathbb{V}(y_i) / \mathbb{V}(\varepsilon_i)$ scales the residuals, and $\varepsilon_i$ are independent and identically distributed errors. The goal of robust regression is to estimate $\pmb{\beta}$ even when some observations $y_i$ are contaminated.

4.2.1. M-estimation

M-estimators are a class of robust estimators that generalize maximum likelihood estimators. They aim to find the regression coefficients $\hat{\pmb{\beta}}$ by solving a minimization problem related to a chosen loss function or, equivalently, a weight function $\psi$ .

The M-estimator for $\hat{\pmb{\beta}}$ is defined as the solution to the following equation: $ \sum_{i=1}^n \mathbf{x}_i \psi \left( \frac{y_i - \mathbf{x}_i^\top \pmb{\beta}}{\hat{\sigma}} \right) = \mathbf{0}, $ where:

$n$ is the number of data points.
$\mathbf{x}_i$ is the vector of predictors for the $i$ -th observation.
$\psi$ is a suitably defined weight function that downweights the influence of large residuals.
$y_i - \mathbf{x}_i^\top \pmb{\beta}$ represents the residual for the $i$ -th observation.
$\hat{\sigma}$ is an estimate for the scale (standard deviation) of the residuals.
$\mathbf{0}$ is a vector of zeros.

Two commonly used weight functions are Huber's and bisquare functions:
Huber's weight function: $ \psi_H(u) = \left{ \begin{array}{ll} u & |u| \leq k, \ \operatorname{sgn}(u) k & |u| > k, \end{array} \right. $ where $u$ is the standardized residual $(y_i - \mathbf{x}_i^\top \pmb{\beta}) / \hat{\sigma}$ , $k$ is a hyperparameter (typically 1.345), and $\operatorname{sgn}(u)$ is the sign function. Huber's function treats small residuals like ordinary least squares (OLS) but limits the influence of larger residuals by capping their value at $\pm k$ .
Bisquare weight function: $ \psi_B(u) = \left{ \begin{array}{ll} u \left[ 1 - \left( \frac{u}{c} \right)^2 \right]^2 & |u| \leq c, \ 0 & |u| > c, \end{array} \right. $ where $c$ is a hyperparameter (typically 4.685). The bisquare function downweights larger residuals more aggressively, giving zero weight to residuals exceeding $\pm c$ , effectively discarding them.

The solution to these equations is typically found using Iteratively Re-weighted Least Squares (IRLS), which involves repeatedly fitting weighted least squares models where weights are updated based on the current residuals.

4.2.2. Least Trimmed Squares (LTS)

Least Trimmed Squares (LTS) is another robust estimation approach. Instead of downweighting, it completely excludes residuals that are considered too large. The LTS estimator of $\pmb{\beta}$ is defined as: $ \hat{\pmb{\beta}} = \underset{\pmb{\beta}}{\operatorname{argmin}} \sum_{i = 1}^{\lfloor np \rfloor} |e|_{(i)}^2, $ where:

$|e|_{(i)}$ denotes the $i$ -th order statistic of the absolute residuals $\{|e_1|, \dots, |e_n|\}$ , meaning they are sorted such that $|e|_{(1)} \leq |e|_{(2)} \leq \cdots \leq |e|_{(n)}$ .
$\lfloor x \rfloor$ denotes the value of $x$ rounded down.
$p$ is the proportion of uncontaminated data (i.e., 1-p is the proportion of outliers).

This method is robust because it minimizes the sum of only the $\lfloor np \rfloor$ smallest squared residuals, effectively ignoring the $(n - \lfloor np \rfloor)$ largest residuals (the suspected outliers). This optimization problem can also be solved using IRLS, where the weight function $\psi(e_i)$ is: $ \psi(e_i) = \left{ \begin{array}{ll} 1, & |e_i| \leq |e|{(\lfloor np \rfloor)}, \ 0, & |e_i| > |e|{(\lfloor np \rfloor)}. \end{array} \right. $ This function assigns a weight of 1 to residuals within the trimmed subset and 0 to those outside.

4.2.3. Data-driven Methods for Tuning Parameters

Both M-estimation and LTS require tuning parameters (k, c for M-estimators, $p$ for LTS). Data-driven methods aim to estimate these parameters from the data itself, rather than using fixed default values.

4.2.3.1. Data-driven M-estimation

For Huber and bisquare regression, $k$ and $c$ can be estimated to maximize the efficiency of the estimator. The iterative procedure proposed by Wang et al. (2007) for estimating $k$ starts by obtaining an initial estimate of $\hat{\pmb{\beta}}$ (e.g., using median regression). From its residuals, an initial scale estimate $\hat{\sigma}$ is computed: $ \hat{\sigma} = 1.4826 \mathrm{median} \left{ |y_i - \mathbf{x}i^\top \hat{\pmb{\beta}}| \right}. $ Here, 1.4826 is a constant chosen so that $\hat{\sigma}$ is a consistent estimator of $\sigma$ under the assumption of normally distributed errors. Then, for a range of values of $k$ , the efficiency $\hat{\tau}(k)$ is evaluated: $ \hat{\tau}(k) = \left{ \sum{i=1}^n \mathbb{I} \left( \lvert \hat{e}i \rvert \leq k \right) \right}^2 \Biggl/ \left{ n \sum{i=1}^n \left[ \mathbb{I} \left( \lvert \hat{e}_i \rvert \leq k \right) \psi_H^2 (\hat{e}_i) + k^2 \mathbb{I} \left( \lvert \hat{e}i \rvert > k \right) \right] \right}, $ where $\hat{e}_i = (y_i - \mathbf{x}_i^\top \hat{\pmb{\beta}}) / \hat{\sigma}$ are the standardized residuals, and $\mathbb{I}(A)$ is an indicator function (1 if event $A$ occurs, 0 otherwise). The optimal $k$ is chosen as the value that maximizes $\hat{\tau}(k)$ . Similar ideas apply to the bisquare case using a more general non-parametric estimator for $\tau$ : $ \hat{\tau}(c) = \left{ \sum{i=1}^n \psi_B' (\hat{e}i) \right}^2 \Bigg/ \left{ n \sum{i=1}^n \psi_B^2 (\hat{e}_i) \right}. $

4.2.3.2. Bacher's Adaptive Least Trimmed Squares (ALTS)

Bacher et al. (2016) developed an adaptive method for LTS to estimate the proportion of clean data $p$ from the data itself. The process starts with an initial estimate $\hat{p}_0 = 0.5$ . Using this $\hat{p}_0$ and an initial $\hat{\pmb{\beta}}_0$ , an LTS estimator $\hat{\pmb{\beta}}_1$ is found, yielding residuals. From these residuals, an initial estimate of the standard deviation of the clean data, $\hat{\sigma}_0$ , is obtained using: $ \hat{\sigma}0 = \frac{1}{\Phi^{-1}(3/4)} |e|{(\lfloor \hat{p}_0 n \rfloor)}, $ where $\Phi^{-1}$ is the normal quantile function. The constant $1/\Phi^{-1}(3/4) \approx 1.4826$ , which is derived from the median absolute deviation (MAD) for normally distributed data. A new estimate for $p$ , $\hat{p}_1$ , is then computed: $ \hat{p}_1 = \frac{1}{n} \operatorname{max} \left{ i \in \left{1, \dots, n\right}: s_i^2 \leq \hat{\sigma}_0^2 \right}, $ where $s_i^2 = \sum_{j=1}^i |e|_{(j)}^2 / i$ is the average of the $i$ smallest squared absolute residuals. The algorithm then iteratively computes $\hat{p}_j$ and $\hat{\sigma}_j^2$ for $j = 1, 2, \dots$ using: $ \hat{\sigma}j^2 = \frac{1}{\lfloor n \hat{p}j \rfloor} \sum{i=1}^{\lfloor n \hat{p}j \rfloor} |e|{(i)}^2, \ \hat{p}{j+1} = \frac{1}{n} \operatorname{max} \left{ i \in \left{1, \dots, n\right}: s_i^2 \leq \hat{\sigma}_j^2 \right}, $ terminating when $\lfloor n \hat{p}_j \rfloor = \lfloor n \hat{p}_{j+1} \rfloor$ or when the relative change in $\hat{p}$ is below a tolerance $\tau$ .

4.2.4. Extensions to Bacher's ALTS (The Proposed Method)

The paper proposes improvements to Bacher's ALTS to address its identified limitations, primarily its tendency to underestimate the proportion of attacked data and the non-robustness of its variance estimator.

The improvements are based on analyzing the asymptotic distribution of the order statistics of the residuals. Assuming that model errors are independent and identically distributed (i.i.d.) from a $\mathcal{N}(0, \sigma^2)$ distribution, then their absolute values $|e_i|$ follow a folded normal distribution $\mathcal{FN}(0, \sigma^2)$ . The paper states the asymptotic distribution of the statistic $s_i^2 = \frac{1}{i}\sum_{j=1}^i |e|_{(j)}^2$ as $n \to \infty$ : $ \sqrt{n}\left(s_i^2 - \frac{1}{i}\sum_{j=1}^i \xi_j^2\right) \bigg/ \left(\frac{2}{i}\sqrt{\pmb{\xi}i^\top \pmb{\Sigma}{(i)} \pmb{\xi}_i}\right) \overset{\mathrm{d}}{\longrightarrow} \mathcal{N}(0, 1), \quad i = 1, 2, \ldots, n, $ where:

$\tilde{p}_j = j / (n+1)$ is the plotting position for the $j$ -th order statistic.
$\xi_j = F^{-1}(\tilde{p}_j)$ is the $j$ -th quantile of the $\mathcal{FN}(0, \sigma^2)$ distribution.
$F^{-1}(r) = \sigma \Phi^{-1}[(1+r)/2]$ is the quantile function of the $\mathcal{FN}(0, \sigma^2)$ distribution, where $\Phi^{-1}$ is the quantile function of the standard normal distribution.
$|\mathbf{e}|_{(i)} = (|e|_{(1)}, \ldots, |e|_{(i)})^\top$ is the vector of the first $i$ ordered absolute residuals.
$\pmb{\Sigma}_{(i)}$ is the covariance matrix of the first $i$ order statistics, with entry (j, j') given by $\sigma_{(i),jj'} = \frac{\tilde{p}_j(1-\tilde{p}_{j'})}{f(\xi_j)f(\xi_{j'})}$ for $1 \le j \le j' \le i$ .
$f(r) = 2\phi(r/\sigma)/\sigma$ is the probability density function (PDF) of the $\mathcal{FN}(0, \sigma^2)$ distribution, and $\phi$ is the PDF of the standard normal distribution.

This asymptotic result (derived in Appendix A) provides the expected value $\mathbb{E}[s_i^2] \approx \frac{1}{i}\sum_{j=1}^i \xi_j^2$ under the assumption of clean, normally distributed data. Deviations from this expected value help in identifying outliers.

The paper also derives a modified\sigmaestimator (in Appendix B), which is used to initialize and update the variance estimate. To ensure robustness, this estimator uses only a conservative fraction of the smallest residuals: $ \hat{\sigma}^2 = \sum_{i=1}^{\lfloor n/4 \rfloor} |e|{(i)}^2 \middle/ \sum{i=1}^{\lfloor n/4 \rfloor} \left[ \Phi^{-1} \left( \frac{1 + \tilde{p}_i}{2} \right) \right]^2. $ This formulation implies that only the smallest 25% of absolute residuals are used, assuming these are most likely to be clean data points, thereby making the $\hat{\sigma}^2$ estimate more robust to potential outliers in larger residuals.

The entire procedure is summarized in Algorithm 1: Adaptive trimmed least squares:

Inputs:

Data $\mathbf{X} \in \mathbb{R}^{n \times k}$ and response $\mathbf{y} \in \mathbb{R}^n$ .
An initial estimate $\hat{\pmb{\beta}}$ for regression coefficients and corresponding residuals $\hat{\pmb{\varepsilon}} = \mathbf{y} - \mathbf{X}\hat{\pmb{\beta}} = (e_1, \dots, e_n)^\top$ . (A robust method like Least Absolute Deviations (LAD) or quantile regression is recommended for this initial estimate).
A tuning parameter $q$ (default $q = 1.2$ ).
A convergence tolerance $\tau$ (default $\tau = 10^{-4}$ ).

Outputs:

An estimate for the proportion of outliers 1-p.
An estimate for the variance $\hat{\sigma}^2$ of the residuals without outliers.
A robust estimate for the regression coefficients, $\hat{\pmb{\beta}}$ .

Procedure (ALTS):

Initialize $\hat{\sigma}^2$ : Compute $\hat{\sigma}^2$ using the modified estimator (Equation 15) with the initial residuals, summing up to $\lfloor n/4 \rfloor$ smallest absolute residuals: $ \hat{\sigma}^2 = \sum_{j=1}^{\lfloor n/4 \rfloor} |e|{(j)}^2 \Big / \sum{j=1}^{\lfloor n/4 \rfloor} \bigg[ \Phi^{-1} \left( \frac{1 + \tilde{p}_j}{2} \right) \bigg]^2. $
Compute Initial Weights: Compute initial weights $\bar{w}_i = \psi(e_i)$ for $i = 1, \ldots, n$ using the LTS weight function (Equation 4), based on the current set of residuals and an initial $p$ (implicitly determined by the initial $\hat{\pmb{\beta}}$ ).
Fit Model (WLS): Fit the model $\mathbf{y} = \mathbf{X}\pmb{\beta}$ using weighted least squares regression with these weights to obtain updated regression coefficients $\hat{\pmb{\beta}}$ and residuals $e_1, \ldots, e_n$ .
Iterative Refinement Loop: a. Update $\hat{\sigma}^2$ : Re-compute $\hat{\sigma}^2$ using the modified estimator (Equation 15) with the updated residuals, again summing up to $\lfloor n/4 \rfloor$ smallest absolute residuals. b. Compute $s_i^2$ and $\mathbb{E}[s_i^2]$ : For each $i = 1, \ldots, n$ , calculate $s_i^2 = \sum_{j=1}^i |e|_{(j)}^2 / i$ and its asymptotic expected value $\mathbb{E}[s_i^2]$ (from Equation 12). c. Estimate $p$ ( $\hat{p}_i$ ): Define a set $S = \{ i \in \{1, \dots, n\} : s_i^2 / \mathbb{E}[s_i^2] \leq q \}$ . The new estimate for the proportion of clean data is $\hat{p}_i = |S| / n$ , where $|S|$ is the number of indices in $S$ . This step identifies how many observations are consistent with the assumed clean data distribution, allowing for a tolerance $q$ . d. Lower Bound for $\hat{p}_i$ : If $\hat{p}_i < 0.5$ , set $\hat{p}_i = 0.5$ and break the loop. This prevents scenarios where more than half the data are considered contaminated, ensuring a minimum proportion of clean data. e. Re-compute Weights and Fit Model (WLS): Using the updated $\hat{p}_i$ , re-compute the weights $w_1, \ldots, w_n$ (using Equation 4) and fit the model again with weighted least squares to get new $\hat{\pmb{\beta}}$ and residuals. f. Check for Convergence: If the relative change in $\hat{p}_i$ (i.e., $| \hat{p}_i - \hat{p}_{i-1} | / \hat{p}_i$ ) is less than the tolerance $\tau$ , break the loop.
End Procedure.

The choice of the tuning parameter $q$ is crucial. Numerical experiments show it can be dependent on data structure but less on outlier proportion. A grid search or k-fold cross-validation is recommended for selecting $q$ to minimize forecasting error (e.g., MAPE). The paper uses smoothed quantile regression (He et al., 2021) with the 0.5 quantile (equivalent to $L_1$ regression) for initial coefficient estimates due to its robustness.

5. Experimental Setup

5.1. Datasets

The primary dataset used for the case study is the Global Energy Forecasting Competition 2012 (GEFCom2012) data (Hong et al., 2014).

Source and Characteristics: The GEFCom2012 dataset comprises hourly load and temperature history from 20 different zones. The data spans from January 1, 2004, to June 30, 2008. The paper specifically removes rows with missing values. The load is measured in Megawatts (MW).
Training and Validation Split:
- Training Set: Data from the years 2005 and 2006.
- Validation Set: Data from the year 2007.
Forecasting Model: The paper uses the multiple linear regression model introduced by Hong (2010), which also served as the benchmark for GEFCom2012. This model includes:
- Main Effects: $L$ (trend of the data/Lth row), $M$ (month, categorical with 12 levels), $W$ (weekday, categorical with 7 levels), $H$ (hour, categorical with 24 levels), $T$ (current temperature), $T^2$ , and $T^3$ .
- Interaction Terms: HW (Hour-Weekday), TM (Temperature-Month), $T^2M$ , $T^3M$ , TH (Temperature-Hour), $T^2H$ , and $T^3H$ .
- This model results in 285 regression parameters.
Cyberattack Simulation: The dataset is subjected to simulated cyberattacks using two templates:
- Random Attacks: These attacks randomly scale a proportion of the data. For $n$ $n$ data points, $\lfloor (1-p)n \rfloor$ $⌊(1 - p) n ⌋$ points are randomly selected. Their values $y_t$ $y_{t}$ are replaced with $(1 + s/100)y_t$ $(1 + s /100) y_{t}$ , where $s \sim \mathcal{N}(\mu, \sigma^2)$ $s \sim N (μ, σ^{2})$ is drawn independently for each attacked point.
  - Economic loss attacks: $\mu > 0$ , leading to overestimation of demand.
  - System blackout attacks: $\mu < 0$ , leading to underestimation of demand.
  - Parameters: $\mu$ and cv (coefficient of variation, $\mu/\sigma$ ).
- Ramp Attacks: These attacks manipulate data over a specific time interval, gradually increasing or decreasing the load. The data is split into intervals of length $(t_e - t_s)$ $(t_{e} - t_{s})$ , and $\lfloor (1-p)n / (t_e - t_s) \rfloor$ $⌊(1 - p) n / (t_{e} - t_{s})⌋$ of these intervals are attacked.
  - Formulation 1: $y_{t,a} = \left\{ (1 + \lambda (t - t_s)) y_t, \quad t_s < t \leq \frac{1}{2}(t_s + t_e), \right.$ (This formula provided in the text is incomplete, likely it means to ramp up/down and then reverse for the second half of the interval)
  - Parameters: $\ell$ (length of the attack window, $t_e - t_s$ ) and $\lambda$ (scale parameter).
  - Formulation 2 (as used by Jiao et al. 2022): $\ell$ and $\gamma = 1 + \ell\lambda/2$ .
    
    The following figure (Figure 1 from the original paper) provides a concrete example of the GEFCom2012 data and the applied cyberattacks:
    
    该图像是一个图表，展示了电力负荷数据在不同情况下的变化趋势，包含(a)干净数据，(b)随机攻击下数据，以及(c)渐进攻击下的数据负荷变化情况。 Figure 1. (a) A subset of the GEFCom2012 data. (b) Random-attacked data using (18) with $\mu = 40, \sigma = 7, p = 0.3$ . (c) Ramp-attacked data using (19) with $p = 0.3, \lambda = 0.03, t_e - t_s = 50$ .

These datasets and attack templates are effective for validating the method's performance because GEFCom2012 is a well-established benchmark in energy forecasting, and the chosen attack types (random, ramp) represent plausible malicious interventions designed to cause specific disruptions (economic loss, system blackout), directly addressing the paper's motivation.

5.2. Evaluation Metrics

The primary evaluation metric used in this paper is the Mean Absolute Percentage Error (MAPE).

5.2.1. Mean Absolute Percentage Error (MAPE)

Conceptual Definition: MAPE quantifies the accuracy of a forecast by measuring the average magnitude of the errors in percentage terms. It is particularly useful when comparing the accuracy of forecasts across different time series or datasets that have different scales, as it normalizes the errors by the actual values. A lower MAPE value indicates a more accurate forecast.
Mathematical Formula: $ \mathrm{MAPE} = \frac{100}{n} \sum_{i=1}^n \frac{|\mu_i - \hat{y}_i|}{|\mu_i|}, $
Symbol Explanation:
- $n$ : The total number of data points or observations being forecasted.
- $\mu_i$ : The actual (true) expected value of the dependent variable for the $i$ -th observation in the model (e.g., the unattacked electricity load).
- $\hat{y}_i$ : The forecasted value of the dependent variable for the $i$ -th observation.
- $|\cdot|$ : Denotes the absolute value.
- $\sum_{i=1}^n$ : Represents the sum over all $n$ observations.

5.3. Baselines

The paper compares its new adaptive trimmed regression method (referred to as "New") against a comprehensive set of baselines, including:

Bacher's ALTS (Jiao et al., 2022): The original Adaptive Least Trimmed Squares method, as implemented and evaluated by Jiao et al. (2022) for the same problem. This is the most direct comparison as the "New" method is an extension of Bacher's ALTS.
M-estimation Methods (Fixed Parameters):
- Bisquare: M-estimator using the bisquare weight function with a fixed, typical tuning parameter (e.g., $c = 4.685$ ).
- Huber: M-estimator using Huber's weight function with a fixed, typical tuning parameter (e.g., $k = 1.345$ ).
Data-Driven M-estimation Methods:
- Bisquare DD (Data-Driven): M-estimator using the bisquare weight function where the tuning parameter $c$ is adaptively estimated from the data (as per Wang et al., 2007, 2018; Jiang et al., 2019).
- Huber DD (Data-Driven): M-estimator using Huber's weight function where the tuning parameter $k$ is adaptively estimated from the data.
$L_1$ Regression: Also known as Least Absolute Deviations (LAD) regression, it minimizes the sum of absolute residuals. This method is inherently more robust to outliers than OLS and serves as the initial robust estimate for the proposed ALTS method. It is based on work by Luo et al. (2019).
Least Squares (LS): The standard Ordinary Least Squares (OLS) regression, which minimizes the sum of squared residuals. This is included as a non-robust benchmark to highlight the necessity of robust methods in cyberattack scenarios.

These baselines are representative because they cover a spectrum of regression techniques, ranging from non-robust (LS) to moderately robust (Huber, $L_1$ ) to highly robust (Bisquare, Bacher's ALTS), and include both fixed and data-driven parameter approaches. This allows for a thorough assessment of the proposed method's performance across different levels of robustness and adaptability. All results are averaged over five separate simulation runs.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results consistently demonstrate the superior performance of the proposed new adaptive trimmed regression method, especially under high data contamination, when compared to Bacher's ALTS and many other robust regression techniques.

6.1.1. Simulation Study Results

The simulation study (using model (16) with $n=2000$ , $\sigma_1=0.1$ , $\sigma_2=1.3$ ) directly compares the new method with Bacher's ALTS across various proportions of clean data ( $p$ ).

The following are the results from Table 2 of the original paper:

p	Bacher's p	New p	Bacher's b	New b	Bacher's MAPE	New MAPE
0.5	0.712	0.527	0.239	0.189	2.625%	1.305%
0.55	0.723	0.564	0.202	0.174	2.185%	1.265%
0.6	0.745	0.608	0.178	0.160	1.668%	1.139%
0.65	0.772	0.653	0.161	0.149	1.468%	1.114%
0.7	0.799	0.703	0.145	0.139	1.339%	0.983%
0.75	0.829	0.749	0.133	0.130	1.045%	0.908%
0.8	0.861	0.809	0.123	0.123	1.029%	0.897%
0.85	0.894	0.857	0.116	0.116	0.938%	0.905%
0.9	0.927	0.911	0.108	0.110	0.838%	0.844%
0.95	0.961	0.960	0.103	0.105	0.672%	0.722%
1	0.994	0.993	0.098	0.100	0.754%	0.744%

Estimates of $p$ ( $\hat{p}$ ): The "New p" column shows that the proposed method's estimates for $p$ are much closer to the true $p$ values across the board. In contrast, "Bacher's p" consistently overestimates $p$ , especially when the true $p$ is low (i.e., high contamination). For instance, at $p=0.5$ , Bacher's estimates $\hat{p}=0.712$ , significantly overestimating the clean data proportion, while the new method estimates $\hat{p}=0.527$ , which is much closer. This validates the improvement in estimating the true proportion of clean data.
Estimates of $\sigma$ ( $\hat{\sigma}$ ): The "New b" (referring to $\hat{\sigma}$ ) estimates are also more accurate, being closer to the true $\sigma_1 = 0.1$ of the clean data. Bacher's method (Bacher's b) significantly overestimates $\sigma_1$ , indicating its variance estimator is less robust.
MAPE Values: The "New MAPE" values are consistently lower than "Bacher's MAPE," especially for lower $p$ values (higher contamination). For example, at $p=0.55$ , the new method achieves a MAPE of 1.265% compared to Bacher's 2.185%, a significant 72% improvement. For large $p$ (few outliers), the differences become negligible, which is a desirable property, as the robust method does not penalize performance on clean data.

This simulation study provides strong evidence that the proposed method offers more accurate estimates of both the proportion of clean data and the standard deviation of clean data, leading to superior forecasting accuracy, particularly when data is heavily contaminated.

6.1.2. Case Study: Electricity Demand Forecasting

6.1.2.1. Random-Attacked Data (Economic Loss, $\mu > 0$ )

The following are the results from Table D.3 of the original paper (partial transcription):

Parameters			MAPE (%)
1-p	µ	c = σ/μ	Jiao	New	Bisq.	Bisq. DD	Hub.	Hub. DD	L1	LS
0.0	50	1/6	6.03	6.03	6.01	6.01	5.99	6.00	6.01	5.99
0.1	50	1/6	5.98	6.06	6.00	6.03	5.83	5.88	5.78	6.48
0.2	50	1/6	6.01	5.92	6.00	6.12	5.91	5.84	6.17	9.25
0.3	50	1/6	6.44	6.07	9.18	6.05	10.81	6.44	7.83	13.53
0.4	50	1/6	11.46	6.92	16.86	17.33	17.93	11.59	17.93	17.53
0.5	50	1/6	17.20	7.93	26.97	29.23	28.53	15.79	29.23	26.97
0.6	50	1/6	22.58	8.99	35.53	39.11	39.11	21.84	39.11	35.53
0.7	50	1/6	27.18	10.37	44.02	48.86	48.86	27.99	48.86	44.02
0.8	50	1/6	31.54	12.21	51.35	57.34	57.34	34.46	57.34	51.35
0.9	50	1/6	35.10	14.47	58.62	65.81	65.81	41.35	65.81	58.62

The following figure (Figure 2 from the original paper) shows the MAPE results for economic loss attacks:

该图像是包含九个子图的图表，展示了不同方法在不同参数条件下预测电力需求的MAPE随参数p变化的趋势。横轴为参数p，纵轴为MAPE，曲线颜色代表不同方法，图中表明新方法在多数情况下表现优于其他方法。 Figure 2. Economic loss results. MAPE results for the new method compared to Bacher's method for a range of values of $\mu$ and $\mathrm{cv} = \mu/\sigma$ , as a function of the proportion of clean data $p$ , and compared to many other methods. For the new method, $q=1.2$ is used in each plot. All the experiments use $\mu > 0$ so that we are investigating economic loss attacks. See also Table D.3.

The new method consistently demonstrates superior forecasting accuracy across all combinations of $\mu$ and cv values for economic loss attacks. This advantage is particularly pronounced when the proportion of attacked data (1-p) is high (i.e., smaller $p$ ).
$L_1$ regression is the next most competitive method, followed by Jiao et al.'s results (using Bacher's ALTS).
As $\mu$ (the average size of the attack) increases, the relative performance between the new method and Bacher's ALTS narrows slightly, but the new method generally maintains its edge.
The data-driven bisquare method (Bisq. DD) shows competitive performance, but generally not as good as the new method or $L_1$ regression for higher contamination.
MAPE values for all methods increase with increasing $\mu$ , indicating that larger magnitude attacks are harder to mitigate, but the new method shows the most resilience.

For $p=1$ (no outliers), all methods perform similarly, which is expected.

The following are the results from Table D.4 of the original paper for small $\mu$ :

Parameters				MAPE (%)
1-p	µ	c = σ/μ	Jiao	New	Bisq.	Bisq. DD	Hub.	Hub. DD	L1	LS
0.0	5	2	6.03	6.03	6.01	6.01	5.99	6.00	6.01	5.99
0.1	5	2	5.98	6.03	5.97	5.98	5.94	5.95	5.93	5.90
0.2	5	2	5.94	6.02	5.92	5.94	5.89	5.90	5.85	5.84
0.3	5	2	5.87	5.99	5.86	5.89	5.82	5.85	5.79	5.79
0.4	5	2	5.85	5.99	5.83	5.86	5.80	5.82	5.76	5.77
0.5	5	2	5.81	5.99	5.80	5.82	5.78	5.80	5.75	5.78
0.6	5	2	5.80	5.96	5.79	5.81	5.77	5.79	5.73	5.78
0.7	5	2	5.78	5.96	5.77	5.79	5.75	5.77	5.72	5.76
0.8	5	2	5.77	5.96	5.76	5.78	5.74	5.76	5.70	5.75
0.9	5	2	5.76	5.94	5.75	5.77	5.73	5.75	5.69	5.74
0.1	10	4	6.02	6.08	6.01	6.02	5.96	5.96	5.94	5.99
0.2	10	4	6.01	6.04	5.99	6.05	5.90	5.91	5.88	6.54
0.3	10	4	6.01	6.13	6.00	6.07	5.93	5.95	5.89	6.35
0.4	10	4	6.01	6.04	5.99	6.05	5.90	5.91	5.88	6.54
0.5	10	4	6.14	5.93	6.09	6.10	6.04	5.94	5.97	6.79
0.6	10	4	6.15	6.04	6.09	6.08	6.03	5.96	5.99	7.01
0.7	10	4	6.15	6.01	6.08	6.07	6.00	5.97	5.97	7.28
0.8	10	4	6.14	6.01	6.06	6.05	5.97	5.96	5.96	7.68
0.9	10	4	6.14	6.00	6.06	6.04	5.95	5.95	5.94	8.06

The following figure (Figure 3 from the original paper) shows the MAPE results for economic loss attacks with small $\mu$ :

$Fig. 3. Economic loss results for small $\\mu$ . MAPE results for the new method compared to Bacher's method for a range of values of $\\mu$ and $\\mathtt { c v } = \\mu / \\sigma$ , as a function of the…$ Figure 3. Economic loss results for small $\mu$ . MAPE results for the new method compared to Bacher's method for a range of values of $\mu$ and $\mathrm{cv} = \mu/\sigma$ , as a function of the proportion of clean data $p$ , and compared to many other methods. For the new method, $q=1$ is used in each plot. All the experiments use $\mu > 0$ so that we are investigating economic loss attacks, and here we look at small $\mu$ as opposed to the larger $\mu$ values given in Fig. 2. See also Table D.4.

For small $\mu$ values (e.g., $\mu=5, 10, 15$ ) and various cv, when cv is low ( $cv=2$ ), the performance of all methods is similar. However, as cv increases (meaning larger $\sigma$ relative to $\mu$ , indicating attacks that are not drastically different from typical data points and thus harder to detect), the new method consistently remains optimal.
For large proportions of attacked data (1-p), the new method and $L_1$ regression are ideal choices.
A key observation is the sensitivity to $q$ : for these small $\mu$ cases, the new method required $q=1$ (instead of $q=1.2$ used elsewhere) to achieve optimal results, which was found via a grid search. This highlights the importance of proper $q$ selection.

6.1.2.2. Random-Attacked Data (System Blackout, $\mu < 0$ )

The following are the results from Table D.5 of the original paper (partial transcription):

Parameters			MAPE (%)
1-p	µ	cv = σ/µ	Jiao	New	Bisq.	Bisq. DD	Hub.	Hub. DD	L1	LS
0.0	-20	-1/4	6.03	6.03	6.01	6.01	5.99	6.00	6.01	5.99
0.1	-20	-1/4	6.22	6.09	6.20	6.07	6.31	6.20	6.51	6.69
0.2	-20	-1/4	6.78	6.36	6.95	6.44	7.09	6.61	7.40	7.79
0.3	-20	-1/4	8.73	6.94	8.77	8.99	8.84	9.26	8.81	9.26
0.4	-20	-1/4	10.84	7.93	10.62	10.71	10.70	11.45	11.45	11.45
0.5	-20	-1/4	13.06	9.23	12.87	12.78	12.78	13.62	13.62	13.62
0.6	-20	-1/4	15.22	10.98	15.01	15.00	15.00	15.82	15.82	15.82
0.7	-20	-1/4	17.27	13.20	17.06	17.08	17.08	18.01	18.01	18.01
0.8	-20	-1/4	19.06	15.86	18.89	18.88	18.88	19.98	19.98	19.98
0.9	-20	-1/4	20.65	18.91	20.50	20.50	20.50	21.80	21.80	21.80

The following figure (Figure 4 from the original paper) shows the MAPE results for system blackout attacks:

$该图像是论文中关于不同方法在不同参数条件下预测误差（MAPE）与比例参数p关系的多子图折线图，展示了各种鲁棒回归方法的性能比较。图中包含参数$9\\mu$和`cv`的不同组合，MAPE随p变化趋势清晰。$ Figure 4. System blackout results. MAPE results for the new method compared to Bacher's method for a range of values of $\mu$ and $\mathrm{cv} = \mu/\sigma$ , as a function of the proportion of clean data $p$ , and compared to many other methods. For the new method, $q=1.2$ is used in each plot. All the experiments use $\mu < 0$ so that we are investigating system blackout attacks. See also Table D.5.

For system blackout attacks ( $\mu < 0$ ), the new method again outperforms all other baselines for almost all $p$ values, demonstrating its robust performance in minimizing forecast errors for these critical attack types.
Jiao et al.'s method (Bacher's ALTS) is generally the second-best, followed by $L_1$ regression.
The performance advantage of the new method is especially notable for large-scale attacks (high 1-p).
As $| \mu |$ increases, the results from different models become more similar, but the new method's curve remains distinct, indicating its consistent superior performance. This trend is not as evident when cv is varied.
For $p=1$ , all methods are competitive, reaffirming the new method's suitability for uncontaminated data as well.

6.1.2.3. Ramp-Attacked Data ( $\ell, \lambda$ Parametrisation)

The following are the results from Table D.6 of the original paper (partial transcription):

Parameters			MAPE (%)
1-p	l	Y	Jiao	New	Bisq.	Bisq. DD	Hub.	Hub. DD	L1	LS
0.0	40	0.05	6.03	6.03	6.01	6.01	5.99	6.00	6.01	5.99
0.1	40	0.05	5.95	6.05	5.98	6.03	5.86	5.89	5.86	6.90
0.2	40	0.05	5.93	6.02	5.96	6.07	5.96	5.90	6.11	9.10
0.3	40	0.05	6.14	6.34	5.97	6.11	7.53	6.26	7.52	13.74
0.4	40	0.05	7.57	6.03	6.61	6.93	11.45	6.76	8.90	17.53
0.5	40	0.05	9.30	6.30	7.91	8.06	14.07	8.02	11.39	24.32
0.6	40	0.05	11.58	6.67	9.46	9.71	16.29	9.44	13.91	31.42
0.7	40	0.05	13.62	7.10	11.02	11.26	18.39	10.87	16.32	38.99
0.8	40	0.05	15.75	7.55	12.44	12.75	20.35	12.33	18.66	46.40
0.9	40	0.05	17.81	8.02	13.88	14.24	22.14	13.83	20.91	53.86

The following figure (Figure 5 from the original paper) shows the MAPE results for ramp attacks (with $\ell, \lambda$ ):

该图像是多子图的折线图，比较了不同方法在电力需求预测中针对不同参数L和λ下MAPE随p变化的表现。图中展示了Jiao、Bisquare、Huber、LS等方法的误差趋势，突出所提新方法的优越性。 Figure 5. Ramp-attack results. MAPE results for the new method compared to Bacher's method for a range of values of $\ell$ and $\lambda$ , as a function of the proportion of clean data $p$ , and compared to many other methods. For the new method, $q=1.2$ is used in each plot. A ramp attack is used, based on the formulation with $\ell$ and $\lambda$ . See also Table D.6.

Under ramp attacks, the new method outperforms other methods for larger-scale attacks (high 1-p).
For smaller proportions of attacked data (lower 1-p), the data-driven bisquare method often shows superior performance. This suggests that for less severe ramp attacks, the bisquare method's specific handling of outliers might be more effective. The non-data-driven bisquare also performs well in this regime.
As $\ell$ (attack window length) increases, the new method's performance starts to converge with the bisquare results, but it consistently outperforms Jiao et al.'s method.
These trends are largely independent of the scale parameter $\lambda$ .

6.1.2.4. Ramp-Attacked Data ( $\ell, \gamma$ Parametrisation)

The following are the results from Table D.7 of the original paper (partial transcription):

Parameters			MAPE (%)
1-p	l	$\gamma$	Jiao	New	Bisq.	Bisq. DD	Hub.	Hub. DD	L1	LS
0.0	100	2	6.03	6.03	6.01	6.01	5.99	6.00	6.01	5.99
0.1	100	2	5.94	6.08	5.97	6.04	5.82	5.83	5.95	9.06
0.2	100	2	6.08	6.17	6.02	6.53	6.85	6.41	7.50	13.80
0.3	100	2	6.71	7.36	6.55	6.62	8.49	7.18	8.49	15.61
0.4	100	2	8.56	7.93	7.25	7.50	11.45	8.33	10.90	23.51
0.5	100	2	10.44	9.08	8.19	8.44	14.07	9.76	13.31	31.42
0.6	100	2	12.60	10.51	9.21	9.46	16.29	11.23	15.68	39.29
0.7	100	2	14.58	12.21	10.27	10.48	18.39	12.78	18.01	47.01
0.8	100	2	16.40	14.19	11.28	11.48	20.35	14.40	20.25	54.34
0.9	100	2	18.06	16.38	12.28	12.48	22.14	16.08	22.39	61.54
0.1	200	3	5.93	6.08	5.99	6.15	5.99	5.96	6.28	17.34
0.2	200	3	6.78	6.87	6.29	9.71	8.67	7.64	9.75	33.42
0.3	200	3	9.07	10.86	7.63	12.08	13.12	11.08	12.34	30.89
0.4	200	3	11.39	13.62	9.14	14.61	16.89	13.91	15.54	46.99
0.5	200	3	13.62	16.71	10.87	17.29	19.98	16.87	18.59	59.98
0.6	200	3	15.75	20.10	12.56	19.85	22.95	20.08	21.72	72.58
0.7	200	3	17.65	23.77	14.17	22.38	25.68	23.49	24.70	84.58
0.8	200	3	19.34	27.70	15.72	24.72	28.14	27.02	27.42	95.84
0.9	200	3	20.85	31.78	17.15	26.91	30.34	30.64	29.98	106.31

The following figure (Figure 6 from the original paper) shows the MAPE results for ramp attacks (with $\ell, \gamma$ ):

$该图像是论文中用于比较不同稳健回归方法在电力需求预测中表现的多子图折线图，展示了在不同参数$L$和$\\gamma$条件下，各方法的平均绝对百分比误差（MAPE）随$p$变化的趋势，突出数据驱动方法的优越性。$ Figure 6. Ramp-attack results. MAPE results for the new method compared to Bacher's method for a range of values of $\ell$ and $\gamma = 1 + \ell\lambda/2$ , as a function of the proportion of clean data $p$ , and compared to many other methods. For the new method, $q=1.2$ is used in each plot. A ramp attack is used, based on the formulation with $\ell$ and $\gamma$ . See also Table D.7.

When ramp attacks are parameterized using $\ell$ and $\gamma = 1 + \ell\lambda/2$ , the new method remains optimal for larger-scale attacks (high 1-p) when $\ell$ is smaller (100, 200).
However, the data-driven bisquare method quickly becomes superior for smaller-scale attacks (lower 1-p).
For very large attack windows (e.g., $\ell=300$ ), the new method is generally inferior to both data-driven and non-data-driven bisquare methods for all $p$ , although it still outperforms Jiao et al.'s results (Bacher's ALTS).
When $\gamma$ increases, the new method's performance becomes more similar to the bisquare methods.
These results highlight that the optimal robust method can depend on the specific characteristics of the cyberattack (template, scale, and window length).

6.2. Data Presentation (Tables)

The full tables have been presented above in their respective sections for brevity and context.

6.3. Ablation Studies / Parameter Analysis

The paper conducts a parameter analysis for its tuning parameter $q$ (Algorithm 1, line 8), which controls the tolerance for identifying clean data points.

The following figure (Figure C.7 from the original paper) shows additional simulation study results, exploring the impact of $q$ values:

$Fig. C.7. Additional simulation study results. Results for 100 simulations on the model (16) for $q \\in \\{ 1 , 1 . 0 7 , \\dots , 1 . 4 9 \\}$ In h plot, the olid lines u ' $\\hat { p }$ for $p$ (b) Est…$ Figure C.7. Additional simulation study results. Results for 100 simulations on the model (16) for $q \in \{1, 1.07, \dots, 1.49\}$ . (a) Estimates $\hat{p}$ for $p$ . (b) Estimates $\hat{\sigma}$ for $\sigma$ . (c) MAPE values, using (17).

Impact on $\hat{p}$ (Figure C.7a): Low $q$ values (e.g., $q \in \{1, 1.07, \dots, 1.28\}$ ) tend to underestimate $p$ . The value $q=1.35$ (chosen for most results in the paper) closely matches the true $p$ values. Higher $q$ values (e.g., $q \in \{1.42, 1.49\}$ ) tend to overestimate $p$ . This plot clearly shows that for certain $q$ values, the new method gives much more accurate estimates for $p$ than Bacher's ALTS.
Impact on $\hat{\sigma}$ (Figure C.7b): The estimates for $\hat{\sigma}$ from the new method are consistently more accurate than Bacher's ALTS, and importantly, these estimates are largely insensitive to the choice of $q$ . This indicates the robustness of the new variance estimator.
Impact on MAPE (Figure C.7c): For small $p$ (high contamination), the new method with any $q$ performs better than Bacher's ALTS. For larger $p$ (fewer outliers), the choice of $q$ becomes more critical. Values of $q$ in the range $\{1.28, 1.35, 1.42, 1.49\}$ yield better MAPE values than Bacher's method across all $p$ . The paper's choice of $q=1.35$ is justified as it consistently leads to more accurate forecasts.

The sensitivity of $q$ implies that it is a hyperparameter that needs careful tuning. The paper notes that $q$ can depend on the data structure, but less so on the proportion of outliers. The selection process involved a grid search over a range of $q$ values, choosing the one that minimized MAPE. This highlights a practical consideration for implementing the method, suggesting the need for cross-validation or other data-driven techniques for $q$ selection in real-world scenarios.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully introduces an improved adaptive trimmed regression method, extending Bacher's ALTS, specifically designed for robust electricity demand forecasting in the presence of cyberattacks. By incorporating a robust variance estimator and a more principled approach to estimate the proportion of clean data based on the asymptotic distribution of order statistics, the new method significantly enhances outlier detection and forecasting accuracy. The research conclusively demonstrates that data-driven tuning parameters lead to superior performance compared to fixed-parameter approaches. In particular, the proposed method outperforms Bacher's ALTS and other M-estimation methods under various random and ramp attack scenarios, especially for highly contaminated data. Even in the absence of outliers, the method provides competitive forecasts, making it a robust and reliable choice for general forecasting applications in critical infrastructure.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose avenues for future research:

Limited Attack Templates: The study only considered random-attack and ramp-attack templates. Future work could investigate other attack types (e.g., pulse, smooth-curve attacks) or combinations thereof, and analyze performance as a function of attack length and proportion.
Linear Regression Model: The case study was limited to a linear regression model. Integrating the proposed robust framework with more complex models, such as time series methods (e.g., state-space models), could yield greater forecast accuracy by capturing temporal correlations.
No Individual Outlier Classification: The paper focused solely on forecasting accuracy and did not delve into the classification of individual data points as outliers. While the new ALTS could potentially offer this (e.g., through the $S$ set in Algorithm 1), it was beyond the scope of the current analysis.
Single Dataset: The study used only the GEFCom2012 dataset. Testing on other datasets could reveal additional features and how results vary across different data characteristics.
Tuning Parameter $q$ Selection: The choice of the tuning parameter $q$ can be sensitive and was largely determined by grid search. Future work could develop data-driven methods for selecting $q$ (similar to how $k$ and $c$ are determined for M-estimators) or investigate how $q$ might optimally vary for ongoing forecasting tasks.
Assumptions on Residuals: The method assumes that clean data residuals are independently and identically distributed (i.i.d.) according to a normal distribution. This assumption might be restrictive if the true distribution of clean data is more complex (e.g., a mixture of distributions). Future research could explore more reliable estimation of mixture distributions.
Generalization to Regression Markets: Following Pinson et al. (2022), the model could be further generalized for energy forecasting in regression markets.

7.3. Personal Insights & Critique

This paper offers a valuable contribution to the field of robust forecasting, particularly in the context of critical infrastructure like electricity grids. The rigor in identifying and addressing the specific shortcomings of Bacher's ALTS, by delving into the asymptotic properties of order statistics, is a strong point. The detailed derivation of the modified $\sigma$ estimator and the new $p$ estimation logic showcases a deep theoretical understanding.

Key Strengths and Inspirations:

Theoretical Grounding: The derivation of the asymptotic distribution of $s_i^2$ provides a solid theoretical foundation for the outlier detection mechanism, moving beyond heuristic approaches.
Practical Relevance: The focus on cyberattacks and their real-world consequences makes this research highly relevant for power system operators and cybersecurity experts. The comparison against various attack types (economic loss, system blackout, random, ramp) demonstrates its practical utility.
Emphasis on Data-Driven Adaptability: The consistent finding that data-driven methods outperform fixed-parameter ones is a crucial takeaway for practitioners. It highlights the need for dynamic, adaptive models in unpredictable environments.
Conservative Outlier Estimation: The goal of conservatively overestimating the proportion of attacked data (to ensure enough outliers are removed, even at the cost of some clean data) is a pragmatic choice in high-stakes scenarios like power grid operation, prioritizing system stability over marginal data efficiency.
Robustness on Clean Data: The observation that the new method performs competitively even on clean data ( $p=1$ ) is significant. It implies that this robust method can be deployed as a general-purpose forecasting tool without fear of degraded performance when attacks are absent, simplifying deployment strategies.

Potential Issues and Areas for Improvement:

Dependence on $q$ Tuning: While the paper explores $q$ 's sensitivity, the reliance on a grid search for optimal $q$ selection is a practical limitation. For real-time, dynamic systems, a truly data-driven, adaptive method for $q$ selection would significantly enhance the method's autonomy and robustness.
I.I.D. Residuals Assumption: The assumption of i.i.d. normally distributed residuals for the asymptotic derivations might be idealistic for complex real-world time series data, where residuals often exhibit auto-correlation or heteroscedasticity. While robust methods generally tolerate minor deviations, significant violations could impact the theoretical guarantees. The paper briefly mentions this as a future work item, suggesting it's a known concern.
Complexity vs. Interpretability: While the method is robust, its iterative nature and reliance on asymptotic distributions increase its computational complexity and reduce direct interpretability compared to simpler models. For system operators, understanding why a particular forecast was made can be as important as its accuracy.
Generalizability to Other Domains: The method's core principles (robust regression, adaptive trimming based on order statistics) could potentially be applied to other domains susceptible to data integrity attacks, such as financial forecasting, supply chain management, or health informatics. However, the specific attack templates and the domain's data characteristics would need careful consideration.

Overall, this paper provides a valuable step forward in making critical infrastructure forecasting more resilient to adversarial conditions, building a stronger foundation for secure and reliable energy systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.