Uncertainty-Aware Knowledge Transformers for Peer-to-Peer Energy Trading with Multi-Agent Reinforcement Learning
TL;DR Summary
This paper presents a novel peer-to-peer energy trading framework combining uncertainty-aware prediction with multi-agent reinforcement learning. The KTU model quantifies uncertainty, optimizing decisions and improving cost-efficiency and revenue, highlighting its potential for r
Abstract
This paper presents a novel framework for Peer-to-Peer (P2P) energy trading that integrates uncertainty-aware prediction with multi-agent reinforcement learning (MARL), addressing a critical gap in current literature. In contrast to previous works relying on deterministic forecasts, the proposed approach employs a heteroscedastic probabilistic transformer-based prediction model called Knowledge Transformer with Uncertainty (KTU) to explicitly quantify prediction uncertainty, which is essential for robust decision-making in the stochastic environment of P2P energy trading. The KTU model leverages domain-specific features and is trained with a custom loss function that ensures reliable probabilistic forecasts and confidence intervals for each prediction. Integrating these uncertainty-aware forecasts into the MARL framework enables agents to optimize trading strategies with a clear understanding of risk and variability. Experimental results show that the uncertainty-aware Deep Q-Network (DQN) reduces energy purchase costs by up to 5.7% without P2P trading and 3.2% with P2P trading, while increasing electricity sales revenue by 6.4% and 44.7%, respectively. Additionally, peak hour grid demand is reduced by 38.8% without P2P and 45.6% with P2P. These improvements are even more pronounced when P2P trading is enabled, highlighting the synergy between advanced forecasting and market mechanisms for resilient, economically efficient energy communities.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Uncertainty-Aware Knowledge Transformers for Peer-to-Peer Energy Trading with Multi-Agent Reinforcement Learning
1.2. Authors
Mian Ibad Ali Shaha, Enda Barretta, and Karl Masona Affiliation: aSchool of Computer Science, University of Galway, Ireland. ORCID for Mian Ibad Ali Shah: https://orcid.org/0009-0008-7288-9757
1.3. Journal/Conference
The paper is published as a preprint on arXiv. The abstract implies it's a novel framework, suggesting it's likely intended for a conference or journal in the fields of energy systems, artificial intelligence, or reinforcement learning. The specific venue is not explicitly stated beyond its arXiv publication.
1.4. Publication Year
The publication date provided is 2025-07-22T17:46:28.000Z. This indicates a future publication date, suggesting it is a preprint (e.g., v1) that has been submitted or is awaiting formal peer review and publication.
1.5. Abstract
This paper introduces a novel framework for Peer-to-Peer (P2P) energy trading that integrates uncertainty-aware prediction with multi-agent reinforcement learning (MARL). Unlike previous works that rely on deterministic forecasts, the proposed approach uses a heteroscedastic probabilistic transformer-based prediction model called Knowledge Transformer with Uncertainty (KTU). The KTU model quantifies prediction uncertainty for robust decision-making in stochastic P2P energy trading environments, using domain-specific features and a custom loss function to provide reliable probabilistic forecasts and confidence intervals. By integrating these uncertainty-aware forecasts into a MARL framework, agents can optimize trading strategies with a clear understanding of risk and variability. Experimental results demonstrate that an uncertainty-aware Deep Q-Network (DQN) significantly reduces energy purchase costs (up to 5.7% without P2P, 3.2% with P2P), increases electricity sales revenue (6.4% without P2P, 44.7% with P2P), and decreases peak hour grid demand (38.8% without P2P, 45.6% with P2P). These improvements are amplified with P2P trading, emphasizing the synergy between advanced forecasting and market mechanisms for efficient and resilient energy communities.
1.6. Original Source Link
https://arxiv.org/abs/2507.16796 (Publication Status: Preprint on arXiv) PDF Link: https://arxiv.org/pdf/2507.16796v1.pdf
2. Executive Summary
2.1. Background & Motivation
The global energy landscape is rapidly shifting towards distributed energy resources (DERs), decarbonization, and decentralized energy markets, with Peer-to-Peer (P2P) energy trading emerging as a promising paradigm. P2P trading empowers prosumers (consumers who also produce energy, typically from renewable energy sources like solar PV) to directly exchange electricity, optimize local renewable energy utilization, and reduce carbon emissions.
However, a central challenge in operating P2P energy markets is the inherent uncertainty associated with renewable generation (e.g., solar, wind) and dynamic load profiles (consumer demand). This variability introduces significant risks that traditional deterministic forecasting methods fail to capture, leading to suboptimal trading and dispatch decisions. Current Reinforcement Learning (RL) and Multi-Agent Reinforcement Learning (MARL) approaches in P2P energy trading often overlook this crucial uncertainty, relying on point forecasts that do not reflect the full spectrum of possible future scenarios. This creates a critical gap in the literature: how to enable risk-informed decision-making in highly stochastic energy environments.
The problem is important because unmanaged uncertainty can undermine both the economic efficiency and system reliability of P2P energy communities, especially with high renewable energy penetration. Additionally, the growing integration of carbon markets into P2P trading requires robust systems that can align economic incentives with broader decarbonization goals.
The paper's entry point is to bridge this gap by integrating uncertainty-aware prediction directly into the MARL framework for P2P energy trading. It aims to provide prosumers with a clear understanding of future energy supply and demand variability, enabling them to make more robust and risk-sensitive trading decisions.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
- Novel Uncertainty-Aware Forecasting Model (KTU): It introduces the
Knowledge Transformer with Uncertainty (KTU)model, aheteroscedastic probabilistic transformer-based prediction modelspecifically designed forP2Penergy trading. This model explicitly quantifies prediction uncertainty for both load andPVgeneration, providing not just point estimates but alsoconfidence intervals. - Integration of Uncertainty into MARL: The framework seamlessly integrates these
uncertainty-aware forecastsinto aMulti-Agent Deep Q-Network (DQN)system. This allowsMARLagents to makerisk-sensitive trading decisionsby understanding the variability and risks associated with future energy availability. - Custom Loss Function and Physics-Informed Constraints: The
KTUmodel is trained with a customloss functionthat combinesGaussian negative log-likelihoodwithdomain-specific regularization terms, ensuring reliableprobabilistic forecasts. It also incorporatesphysics-informed constraintsto ensure plausibility (e.g., noPVgeneration at night). - Enhanced Reward Structure for Economic and Environmental Objectives: The
reward structurefor theMARLagents incorporatesforecast uncertainty,tariff periods, andbattery constraints, effectively addressing both economic objectives (cost reduction, revenue generation) and environmental goals (peak tariff management, carbon emissions indirectly through peak demand reduction). - Automated Hyperparameter Optimization: The approach utilizes
Optunaforautomated hyperparameter optimization, ensuring optimal performance for both the forecasting and trading modules. - Significant Performance Improvements: The experimental results demonstrate substantial improvements over traditional
DQNand other baselines:-
Reduced Energy Purchase Costs: Up to 5.7% reduction without
P2Ptrading and 3.2% withP2Ptrading. -
Increased Electricity Sales Revenue: 6.4% increase without
P2Pand a remarkable 44.7% increase withP2Ptrading. -
Reduced Peak Hour Grid Demand: 38.8% reduction without
P2Pand 45.6% withP2Ptrading. -
Faster Convergence: The
uncertainty-aware DQNachieves convergence approximately 50% faster, requiring about 25% fewer time steps than standardDQN. -
Improved Battery Management: More effective and community-beneficial storage utilization through anticipatory charging during high renewable generation and discharging during peak demand.
These findings collectively establish a new benchmark for
resilient,efficient, andsustainable P2P energy systems, highlighting the critical role ofuncertainty-aware learningfor both economic and operational gains in decentralized energy contexts.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
3.1.1. Peer-to-Peer (P2P) Energy Trading
P2P energy trading refers to a decentralized energy market model where individual energy producers (prosumers) and consumers can directly exchange electricity with each other, rather than solely relying on a central utility grid. This allows prosumers with distributed energy resources (DERs) like solar panels (PV) and battery storage to sell their surplus energy to neighbors or buy energy when needed, optimizing local energy consumption and generation. It promotes local renewable energy utilization, reduces reliance on the main grid, and can lead to cost savings and increased energy independence for communities.
3.1.2. Prosumer
A prosumer is an individual or entity that both produces and consumes energy. In the context of P2P energy trading, this typically refers to households or businesses equipped with DERs (e.g., rooftop solar panels, small wind turbines, battery storage) that can generate their own electricity, consume it, and also trade any surplus or deficit with other participants in a local energy community.
3.1.3. Distributed Energy Resources (DERs)
DERs are small-scale power generation or storage technologies that are located at or near the point of energy consumption. Examples include rooftop solar photovoltaic (PV) systems, wind turbines, battery energy storage systems (BESS), and electric vehicles (EVs) with vehicle-to-grid capabilities. DERs are key components of decentralized energy systems and enable P2P energy trading.
3.1.4. Reinforcement Learning (RL)
Reinforcement Learning (RL) is a paradigm of machine learning where an agent learns to make optimal decisions by interacting with an environment. The agent performs actions in a state, receives rewards or penalties based on its actions, and transitions to a new state. The goal of the agent is to learn a policy (a mapping from states to actions) that maximizes the cumulative reward over time. RL is well-suited for sequential decision-making problems.
3.1.5. Multi-Agent Reinforcement Learning (MARL)
Multi-Agent Reinforcement Learning (MARL) extends RL to scenarios involving multiple agents interacting within the same environment. In MARL, each agent learns its own policy while considering the actions and policies of other agents. This can lead to complex dynamics, as agents might be cooperative, competitive, or a mix of both. MARL is particularly relevant for P2P energy trading, where multiple prosumers (each an agent) make independent decisions that affect the overall market.
3.1.6. Deep Q-Networks (DQN)
Deep Q-Networks (DQN) is an RL algorithm that combines Q-learning with deep neural networks. Traditional Q-learning uses a Q-table to store the maximum expected future rewards for taking a specific action in a given state. However, Q-tables become impractical for environments with large or continuous state-action spaces. DQN addresses this by using a deep neural network to approximate the Q-values, enabling RL to handle high-dimensional environments effectively. Key features of DQN include experience replay (storing and sampling past experiences) and a separate target network (a copy of the main network that is updated periodically) to stabilize training.
3.1.7. Transformers
Transformers are a neural network architecture introduced in 2017, initially for natural language processing, but later adapted for various tasks, including time series forecasting. Their key innovation is the self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when processing each element, capturing long-range dependencies efficiently without relying on sequential processing (like Recurrent Neural Networks).
3.1.8. Probabilistic Forecasting
Unlike deterministic forecasting, which provides a single point estimate for a future value (e.g., "the load will be 10 kW"), probabilistic forecasting provides a range of possible future values along with their associated probabilities. This can be expressed as a probability distribution or as prediction intervals (e.g., "there is a 90% chance the load will be between 8 kW and 12 kW"). Probabilistic forecasts are crucial for risk-aware decision-making in uncertain environments like energy markets.
3.1.9. Heteroscedasticity
In statistics, heteroscedasticity refers to a situation where the variability (or variance) of the error term in a regression model is unequal across the range of predicted values. In probabilistic forecasting, a heteroscedastic model means that the predicted uncertainty (e.g., the width of the confidence interval) can vary depending on the input or predicted value. For example, a heteroscedastic energy forecasting model might predict a wider confidence interval for PV generation during cloudy periods (higher uncertainty) than during clear, sunny periods (lower uncertainty). This is more realistic than homoscedastic models which assume constant variance.
3.1.10. Confidence Intervals / Prediction Intervals
A confidence interval (or prediction interval in forecasting) provides a range of values within which the true value of a parameter or future observation is expected to lie with a certain probability. For example, a 95% prediction interval for energy load means that there is a 95% chance that the actual load will fall within that specified range. These intervals are vital for risk management, allowing decision-makers to understand the potential variability and make more robust plans.
3.1.11. Double Auction (DA)
A double auction is a market mechanism where both buyers and sellers submit bids and offers simultaneously. Buyers submit bids indicating the maximum price they are willing to pay for a certain quantity, and sellers submit offers (asks) indicating the minimum price they are willing to accept for a certain quantity. An auctioneer then matches buyers and sellers whose bids are higher than or equal to the offers, clearing the market at an agreed-upon price. This mechanism is common in financial markets and is adapted here for P2P energy trading.
3.2. Previous Works
The paper frames its contribution by contrasting it with several types of prior research:
- Early
P2PMarket Mechanisms: Zhou et al. [41] showed that early community market mechanisms often applied uniform prices, which limited individualized incentives. Zheng et al. [38] introducedauction-based methodsfor trader-specific pricing, but these struggled with real-world uncertainties intrader behaviorandenergy supply. RLinP2PEnergy Trading: May et al. [19] demonstratedMARLas a promising solution foragentsto learn optimal strategies in dynamic environments. Bhavana et al. [3] identified persistent technical challenges regarding scalability anduncertainty management. Bassey et al. [2] investigatedAI applicationsin trading strategy optimization. However, a significant limitation highlighted is that most of these implementations rely ondeterministic forecasts, failing to captureinherent variability. Zhang et al. [37] explicitly showed thatforecasting errorssignificantly impact market efficiency, underscoring the need foruncertainty-aware models.- Transformer Architectures for Energy Forecasting: Liu et al. [17] showed promising results for
Transformersin energy forecasting. However, these approaches primarily address single-agent settings or producedeterministic outputswithoutuncertainty quantification. DQN-based Approaches: Chen et al. [5] developed aDQN-based approach for price prediction, but again, withoutuncertainty quantification.- Uncertainty-Aware but Lacking Integration: El et al. [9] investigated
uncertainty-aware prosumer coalitional gamesbut did not integrateprobabilistic forecastingwithmulti-agent learning. Yazdani et al. [36] proposedrobust optimizationfor real-time trading, and Uthayansuthi et al. [32] combined clustering, forecasting, anddeep reinforcement learning. Yet, these approaches either lack advancedneural forecastingintegration or focus primarily on economic optimization without fully considering the impact of uncertainty onMARLdecision-making.
3.2.1. The Attention Mechanism (from "Attention Is All You Need" [33])
Since the paper uses a Transformer-based model, Attention is a foundational concept. The core idea behind Attention is to allow the model to weigh the importance of different parts of the input sequence when processing each element. This mechanism captures long-range dependencies without being constrained by sequential processing, which is a limitation of Recurrent Neural Networks (RNNs).
The Attention mechanism takes three inputs: a Query (Q), Keys (K), and Values (V). These are typically derived from the same input sequence in self-attention (as used in Transformers). The Query represents what we are looking for, the Keys represent what is available, and the Values are the actual information associated with the Keys.
The formula for Scaled Dot-Product Attention is:
Where:
- : The
querymatrix, representing the current element for which we want to compute the attention. Its shape is typically . - : The
keymatrix, representing all elements in the sequence that thequerywill be compared against. Its shape is typically . - : The
valuematrix, containing the information to be extracted from the sequence, weighted by the attention scores. Its shape is typically . - : The dimension of the
keys(andqueries). This scaling factor is used to prevent the dot product from growing too large, which could push thesoftmaxfunction into regions with very small gradients. - : The transpose of the
keymatrix. - : This calculates the
attention scoresby taking the dot product of thequerywith allkeys, scaled by . This measures the similarity between thequeryand eachkey. - : This function normalizes the
attention scoresto sum to 1, effectively creating aprobability distributionover thevalues. This distribution indicates how much attention eachvalueshould receive. - : The
valuesare then multiplied by these normalizedattention scores, resulting in a weighted sum of thevalues. This weighted sum is theattention output, emphasizing the most relevant information from the input sequence.
3.3. Technological Evolution
The field has evolved from:
- Centralized Grid Management: Traditional energy systems with large power plants and a central utility managing supply and demand.
- Emergence of DERs: The rise of rooftop solar, batteries, and
EVsleading toprosumers. - Basic
P2PConcepts: Initial ideas forP2Ptrading, often with simple pricing or rule-based strategies. RL/MARLforP2P: Application ofRLto optimize individualprosumerdecisions, thenMARLfor multi-prosumerinteractions. These initially relied ondeterministic forecasts.Transformerfor Forecasting: Adoption of advanceddeep learningarchitectures likeTransformersfor more accurate time series forecasting.- Uncertainty Quantification: Recognition of the limitations of
deterministic forecastsand the need forprobabilistic forecasting. - Integration of Uncertainty-Aware
MARL(This Paper): This paper represents a crucial step in this evolution by combining advancedprobabilistic Transformer-based forecasting withMARLto enablerisk-sensitive decision-makinginP2Penergy markets, addressing economic efficiency and environmental goals simultaneously.
3.4. Differentiation Analysis
Compared to the main methods in related work, the core differences and innovations of this paper's approach are:
- Explicit Uncertainty Quantification: Unlike most prior
RL/MARLP2Ptrading models that rely ondeterministic forecasts, this work explicitly quantifies the uncertainty inloadandPVgeneration using aheteroscedastic probabilistic model(KTU). This is a fundamental shift from point estimates toprobability distributions. - Integration of Uncertainty into
MARLState Space: Theuncertainty estimatesgenerated byKTUare directly incorporated into thestate spaceof theDQNagents. This allowsMARLagents to makerisk-sensitive decisionsrather than just optimizing based onmean forecasts. - Novel
KTUModel Design: TheKTUmodel itself is an innovation, building onTransformerarchitectures with dual output heads formeanandvarianceand a customloss functionthat includesdomain-specific regularization(e.g., temporal smoothness, penalty for nighttimePVgeneration). - Comprehensive Reward Function: The
reward functionis designed to account forforecast uncertainty(confidence score),grid tariff periods(peak, non-peak), andbattery constraints, aligningagentincentives with both economic and system stability goals. - Synergy of Advanced Forecasting and Market Mechanisms: The paper demonstrates that the benefits of
uncertainty-aware forecastingare amplified when combined withP2Ptrading mechanisms, leading to more significant improvements in cost, revenue, and peak demand reduction. - Automated Optimization: The use of
Optunaforhyperparameter tuningensures robust and optimal performance of theKTUmodel.
4. Methodology
4.1. Principles
The core principle of this paper's methodology is to enable multi-agent reinforcement learning (MARL) agents in a Peer-to-Peer (P2P) energy trading environment to make risk-sensitive and anticipatory decisions by explicitly understanding the uncertainty of future energy generation and load. Instead of relying on single-point (deterministic) forecasts, the system uses probabilistic forecasts that provide both a predicted mean and an associated variance, reflecting the inherent stochasticity of renewable energy and consumer demand. This uncertainty information is then fed directly into the MARL state space, allowing Deep Q-Network (DQN) agents to optimize their trading and battery management strategies with a clearer understanding of potential risks and opportunities, ultimately leading to more robust, economically efficient, and community-beneficial outcomes.
4.2. Core Methodology In-depth (Layer by Layer)
The proposed framework, illustrated in Figure 1, integrates a Knowledge Transformer with Uncertainty (KTU) model for probabilistic forecasting with a Multi-Agent Deep Q-Network (DQN) system for P2P energy trading.
4.2.1. Data and Feature Engineering
The foundation of accurate forecasting and MARL training is robust data and feature engineering.
- P2P Community: The study considers a
P2Penergy trading community comprising 10 rural Finnishprosumers. This includes 4 dairy farms (data from Uski et al. [31]) and 6 households (synthetic loads based on Finnish profiles and seasonal multipliers [10, 29]), with 2 households owningElectric Vehicles (EVs). - PV Generation:
Photovoltaic (PV)generation is simulated usingSAM[20] (System Advisor Model). Renewable capacities forPVare set to 40% of the annual load, following [30]. - Data Aggregation and Normalization:
Multi-prosumerdata (load, generation) are aggregated and then normalized to bring them to a common scale, which is standard practice for neural networks. - Categorical Encoding:
Prosumer sizeis encoded categorically, allowing the model to differentiate between different types or scales of participants. - Temporal and Seasonal Features:
Cyclical encodings(sine/cosine transformations) are applied to features like hour-of-day, day-of-week, and month-of-year to capture their cyclical nature and allow the model to understand periodicity (e.g., [15]). For example, hour can be encoded as and .One-hot encodingsare used for other discrete temporal features.
- Custom Daylight Feature: A custom
daylight featureis derived from Helsinki's astronomical data to capture high-latitude solar patterns more accurately. This feature likely indicates whether it's daytime or nighttime, and potentially the intensity of daylight. - Supervised Learning Sequences: For the
Transformer-based forecasting model, data sequences are constructed using asliding windowapproach, which is common practice for time series forecasting withTransformers[35]. This involves creating input-output pairs where a sequence of past observations predicts a future sequence.
4.2.2. Knowledge Transformer with Uncertainty (KTU) Model
The KTU model is a heteroscedastic probabilistic transformer designed for energy forecasting. Its architecture builds on recent Transformer developments [24, 33].
The overall KTU-DQN Ensemble Architecture is depicted in Figure 1.
该图像是KTU-DQN集成架构示意图,展示了用于P2P能源交易的框架。图中包括数据源、特征工程、知识转化器、概率预测、多智能体DQN及其交互,强调了各部分如何协同工作以实现优化与评估。
Figure 1. KTU-DQN Ensemble Architecture for P2P Energy Trading
The KTU model's key components and processes are:
-
Input Projection Layer: This layer receives the engineered features. It includes
learnable positional encodings[39], which are crucial forTransformersto understand the order of elements in a sequence, as the coreself-attentionmechanism is permutation-invariant. This layer projects the input features into a higher-dimensional space (e.g., 128-dimensional). -
Transformer Encoder: The projected inputs are then passed through a
Transformer encoder. This encoder utilizesmulti-head self-attentionto model complex temporal dependencies within the input sequences [40].-
Multi-Head Self-Attention: This mechanism allows the model to jointly attend to information from different representation subspaces at different positions. The
MultiHeadmechanism is defined as:Where:
-
: The output of the
Multi-Head Attentionmechanism. -
, , : The
query,key, andvaluematrices, respectively. These are derived from the input sequence through linear transformations. -
: The number of
attention heads. -
head _ { i }: The output of the -thattention head. -
: A concatenation operation that joins the outputs of all
attention headsalong the feature dimension. -
: An output projection matrix that linearly transforms the concatenated
headoutputs back to the desired dimension.Each
attention headhead _ { i }is computed using theScaled Dot-Product Attentionformula:
Where:
- , , and : Learnable projection matrices for the -th head, which transform the input into different representation subspaces.
- : Refers to the
Scaled Dot-Product Attentionfunction described in the "Prerequisite Knowledge" section.
-
-
-
Dual Output Heads for Probabilistic Predictions: A distinctive feature of
KTUis its dual output heads. Instead of predicting a single point estimate, these heads predict both themean() andvariance() for each target variable. Thevarianceis typically predicted using aSoftplus activationfunction to ensure it remains positive (asvariancemust be non-negative). This mechanism allowsKTUto capturealeatoric uncertainty[22], which is the inherent randomness in the data itself.The model outputs
probabilistic predictionsfor bothprosumer load() andphotovoltaic (PV) generation() as follows:Where:
- : The
probability distributionof theloadat forecast horizon , given input features . - : The
probability distributionof thePV generationat forecast horizon , given input features . - : Denotes a
Normal (Gaussian) distribution. - and : The predicted
meanandvarianceforload, respectively, as functions of the input features . - and : The predicted
meanandvarianceforPV generation, respectively, as functions of the input features . - : The input features available at time .
- : The
forecast horizon, meaning the prediction is for time steps into the future.
- : The
-
Physics-Informed Constraints: To ensure physical plausibility, especially for
PVgeneration, thePV mean predictionis modulated byphysics-informed constraintsbased ondaylightandseasonality[14]:Where:
- : The final,
physics-informed mean predictionforPV generation. - : The
softplus activation function, defined as , which ensures that thePVgeneration is non-negative. - : The raw
mean predictionforPVfrom the neural network. - : A binary feature (0 or 1) indicating if it is
daylightat time . This ensures noPVgeneration is predicted during nighttime. - : A normalized
daylight feature(0 to 1) accounting forseasonality(e.g., longer daylight hours in summer, shorter in winter) [14].
- : The final,
-
Hyperparameter Optimization: The model's
hyperparametersare optimized usingOptuna[1], an automatedhyperparameter optimization framework. This targets athree-hour-ahead joint forecastofprosumer loadandPV generation. Keyhyperparameterstuned includelearning rate,batch size,hidden dimensions,number of attention heads(),dropout rate, andregularization weights. -
Architectural Details: The
KTUarchitecture consists of:- A two-layer
feedforward networkprojecting inputs to a 128-dimensional space. Layer normalizationandReLU activation[35].- Two
Transformer encoder layers, each with fourattention headsand 512-dimensionalfeedforward networkswith 0.1dropout. Dual output headswithSoftplus activationforvariance prediction.
- A two-layer
-
Composite Loss Function: The model is optimized using a
composite loss functionthat combines theGaussian negative log-likelihood(NLL) withdomain-specific regularization terms:Where:
- : The total
composite loss. \frac { 1 } { 2 } \sum _ { i = 1 } ^ { N } \left[ \log ( \sigma _ { i } ^ { 2 } + \epsilon ) + \frac { ( y _ { i } - \mu _ { i } ) ^ { 2 } } { \sigma _ { i } ^ { 2 } + \epsilon } \right]: This is theGaussian Negative Log-Likelihood (NLL) lossforheteroscedastic regression. It measures how well the predictedNormal distribution(with mean and variance ) matches the true target . Lower NLL indicates a better fit and betterprobabilistic calibration.- : The number of training samples.
y _ { i }: The ground truth target value for sample .- : The predicted
meanfor sample . - : The predicted
variancefor sample . - : A small constant (e.g., ) added for numerical stability, preventing division by zero or
logarithmof zero if becomes very small.
- : This is a
temporal smoothness regularization term. It penalizes large, abrupt changes in consecutivemean predictions, promoting smootherforecastswhich are often more realistic in energy systems.- : A
hyperparametercontrolling the strength of thisregularization. - : The
sequence lengthof theforecast.
- : A
- : This is a
physics-informed regularization termthat penalizesphysically impossible nighttime PV generation.- : A
hyperparametercontrolling the strength of thispenalty. P _ { t }: The predictedPV generationat time .- : A binary feature (0 for night, 1 for day). So, is 1 at night and 0 during the day. This term ensures that if during nighttime (), a penalty is incurred.
- : A
- : The total
-
Probabilistic Forecasting Evaluation: For evaluation, samples are drawn from the predicted
meanandvariancedistributions to constructempirical confidence intervals. Metrics likePrediction Interval Coverage Probability (PICP),Mean Prediction Interval Width (MPIW), andContinuous Ranked Probability Score (CRPS)are used to assess the quality andcalibrationof theseprobabilistic forecasts.
4.2.3. Deep Q-Networks (DQN)
The DQN framework is used for the Multi-Agent Reinforcement Learning part.
-
Q-function Approximation: Instead of a traditional
Q-table,DQNuses adeep neural network(parameterized by ) to approximate theQ-function, which estimates the maximum expected futurerewardfor taking a particularactionin a givenstate. -
TD Error Minimization: The neural network's parameters are updated via
stochastic gradient descentto minimize thetemporal-difference (TD) error. TheTD erroris the difference between the currentQ-value estimateand thetarget Q-value. The update rule for theDQNparameters is:This formula, as presented in the paper, appears truncated. A complete
DQNupdate typically minimizes the squaredTD error:L = (r_t + \gamma \max_{a'} q(s_{t+1}, a'; \theta_{\text{TD}}) - q(s_t, a_t; \theta))^2. The update equation usually looks like: However, strictly adhering to the provided text, the formula as written is: Where (based on standardDQNinterpretation given the context):- : The updated parameters of the
DQNat time . - : The current parameters of the
DQNat time . - : The
learning rate, controlling the step size of parameter updates. r _ { t }: Theimmediate rewardreceived at time .- : The
discount factor(between 0 and 1), determining the importance of futurerewards. - : The maximum
Q-valuefor the next state across all possible next actions , estimated by thetarget networkwith parameters . - : The
Q-valuefor the current state and action , estimated by themain networkwith parameters . - : The parameters of the
target network, which are periodically updated to match themain network's parameters. This stabilizes training by providing a more stabletargetforQ-valueupdates.
- : The updated parameters of the
4.2.4. Pricing Model and Double Auction
The market mechanism facilitates P2P energy trading:
- Distributed P2P Energy Trading Model: The system operates with
prosumers(agents) managing their ownload,generation, andbatteryindependently. They only share their electricitysurplusordeficitwith acentralized auctioneer. This preserves privacy as additional sensitive information is withheld. - Centralized Auctioneer: The
auctioneeracts as both a market clearer and an advisor. It evaluates market conditions and determines theInternal Selling Price (ISP)andInternal Buying Price (IBP). - Supply and Demand Ratio (SDR): The
ISPandIBPare determined using theSupply and Demand Ratio (SDR)method [16], allowing forreal-time price settingbased on current systemdemandandsupply.ISP: The price at whichprosumerscan sell their surplus energy within the community.IBP: The price at whichprosumerscan purchase energy within the community.
- Double Auction (DA) Mechanism: The paper employs a
double auctionmechanism, adapted from Qiu et al. [23].- Bids and Offers: Each
buyer() submits abidwith a price and quantity . Eachseller() submits anofferwith a price and quantity . - Order Books: The
auctioneermaintains twoorder books: forbuy ordersand forsell orders, both sorted by price (typicallybuy ordersfrom highest to lowest,sell ordersfrom lowest to highest). - Market Clearing: The
auctioneerclears the market by matchingbuyersandsellersbased on theirbid/offer pricesand quantities, using the calculatedISPandIBP. The goal is to maximize local exchange and efficiency before resorting to transactions with the externalgrid.
- Bids and Offers: Each
4.2.5. Proposed Approach (Integration)
The overall framework combines the KTU probabilistic forecasts with MARL DQN agents within a P2P energy trading simulation. The process flow is shown in Figure 2.
该图像是图示,展示了基于不确定性预测的DQN模拟器的流程,包括代理训练、P2P社区的市场拍卖和市场清算等步骤,体现了能源转移和数据传输的过程。图中标识了每个步骤及其相应的关键作用。
Figure 2. Process flow of Uncertainty-aware Forecasting DQN simulator
- Simulation Environment: The system simulates 10
prosumer agentsover 2 million timesteps using thePettingZoo framework(a library forMARLenvironments). - Agent Autonomy:
P2Pparticipants are self-interested. TheMARLsetup trains agents independently to maximize their own utility, without explicit coordination. This allows eachagentto adapt to the dynamicactionsof others, modeling decentralized, competitiveP2Penergy trading. - State Space: Each
agent'sstate space() includes:- Current
load() - Current
generation() - Current
battery status() Forecasted load()Forecasted generation()- Crucially,
uncertainty estimatesfrom theprobabilistic forecasting model().
- Current
- Action Space: The
action spaceconsists of discreteactionsrepresenting energy management strategies:- Buying energy from the
P2Pmarket orgrid. - Selling energy to the
P2Pmarket orgrid. - Charging the
battery. - Discharging the
battery. - Self-consumption (using locally generated energy).
- Buying energy from the
- Reward Function: The
reward functionfor eachactionis designed to incorporateforecast uncertainty(via aconfidence score),tariff periods(), andbattery constraints. These are detailed in the "Reward Functions with Uncertainty-aware Forecasting" section below. - Market Clearing: At each timestep, agents submit
bidsandasksbased on their needs andinternal price signals. Thedouble auction (DA)mechanism clears the market, matching trades to maximize local exchange before relying on externalgrid transactions. - Dynamic Pricing:
Dynamic pricingis determined by theSupply and Demand Ratio (SDR)andgrid tariffs, ensuring realistic market behavior. - Scalability: The framework is designed to be modular and parallelizable, featuring a linear-complexity auction and communication-free agents, allowing for straightforward scaling to larger
P2Pcommunities.MARLtraining can further support larger populations through distributed or federated learning.
Algorithm 1: Uncertainty-Aware MARL for P2P Energy Trading
The step-by-step process is formalized in Algorithm 1:
Algorithm 1 Uncertainty-Aware MARL for P2P Energy Trading
1: Initialize: agent set , battery capacities , time 2: while (simulation period) do 3: if end of day then 4: Reset hour counter and increment day 5: end if 6: for each agent do 7: Observe current 8: Get forecasts and uncertainties 9: Form state vector 10: Select action using DQN policy 11: Calculate energy balance 12: if or is buy then 13: Add to BuyerBook 14: else if or is sell then 15: Add to SellerBook 16: end if 17: end for 18: Market Clearing: 19: Calculate Supply / Demand 20: if then 21: Calculate 22: Calculate 23: end if 24: Match buyers and sellers by price priority using ISP and IBP 25: Update rewards and advance time 26: end while
Explanation of Algorithm 1:
- Initialization: The simulation starts by defining the set of
agents(), theirbattery capacities(), and the initial time (). - Simulation Loop: The process continues for a total simulation period .
- End of Day Check: At the end of each day, a counter is reset, and the day is incremented.
- Agent-Specific Actions: For each
agentin the set :- Observation: The agent observes its current
load(),generation(), andbattery state of charge(). - Forecasting and Uncertainty: The
KTUmodel providesload forecasts(),generation forecasts(), and their associateduncertainties(). - State Vector Formation: These observations, forecasts, and uncertainties are combined to form the
state vector. This is a critical step where uncertainty information is fed into theMARLprocess. - Action Selection: The
DQN policy(learned by theDQNfor agent ) selects anactionbased on the currentstate. - Energy Balance Calculation: The
agentcalculates itsenergy balance(). - Order Book Submission:
- If the
agenthas anenergy deficit() or its chosenactionis tobuy, it adds an entry to theBuyerBookwith its ID, the absolute quantity of energy needed, and itsbid price(). - If the
agenthas anenergy surplus() or its chosenactionis tosell, it adds an entry to theSellerBookwith its ID, the quantity of energy available, and itsask price().
- If the
- Observation: The agent observes its current
- Market Clearing: After all agents have submitted their
bidsandoffers:- SDR Calculation: The
Supply and Demand Ratio (SDR)is calculated as the totalsupplyfromSellerBookdivided by the totaldemandfromBuyerBook. - ISP/IBP Calculation: If
SDRis between 0 and 1 (inclusive), theInternal Selling Price (ISP)andInternal Buying Price (IBP)are calculated dynamically using formulas that depend ongrid sell price() andgrid buy price().- The formula for
ISPappears to be: . This represents a dynamic internal price for selling energy within the community. - The formula for
IBPappears to be: . This represents a dynamic internal price for buying energy within the community.
- The formula for
- Matching:
Buyersandsellersare matched byprice priorityusing the determinedISPandIBPto clear the market.
- SDR Calculation: The
- Reward Update and Time Advance:
Rewards() for eachagentare updated based on the outcome of theiractionsand market clearing (using the functions detailed below), and the simulation time advances ().
Reward Functions with Uncertainty-aware Forecasting
The reward functions are designed to guide the DQN agents towards optimal energy management strategies, incorporating forecast uncertainty and grid tariff periods.
Notation:
-
:
Generationfor agent at time . -
:
Loadfor agent at time . -
:
State of Charge(in %) for agent 's battery at time . -
:
Grid tariff period, which can beNormal (N),Non-Peak (NP),Peak (P), orDefault (D). -
: Future
forecasted generationfor agent at horizon . -
: Future
forecasted loadfor agent at horizon . -
:
Peak deficit predictionfor agent . This indicates if the agent is expected to have an energy deficit duringpeak hours. -
:
Confidence scorefor agent at time . This is derived from theuncertainty estimatesof theKTUmodel. A higherconfidence scorelikely means lower predicted uncertainty.Here are the specific reward functions for different actions:
1. Charge and Buy
This action implies the agent is charging its battery and buying energy from the grid or P2P market to do so.
Where:
- : The
rewardis highest if the battery is not full, it's anon-peaktariff period, and a futurepeak deficitis predicted. This encourages charging duringoff-peaktimes to prepare forpeak demand, with extra bonus for higherconfidence. - : A moderate
rewardif the battery is not full and it's anormaltariff period. - : A baseline
rewardif the battery is not full and thegenerationis less thanload(i.e., there's a need for energy). - : No
reward(0) if thegrid tariffispeak, discouraging buying during expensive periods.
2. Buy
This action implies the agent is simply buying energy from the grid or P2P market for immediate consumption, without charging the battery.
Where:
- : The primary condition is that
generationis less thanload(agent needs energy) AND the batterySoCis very low (less than 10%). - The
rewardis lower (0.25) if this condition is met during apeaktariff period, and a moderatereward(0.5) otherwise. Norewardif the condition is not met.
3. Sell
This action implies the agent is selling its surplus energy (not from battery) to the grid or P2P market.
Where:
- : The condition is that
generationis greater thanload(agent has surplus) AND the batterySoCis high (at or above 90%). - The
rewardis higher (0.75) if this condition is met during apeaktariff period (encouraging selling when prices are high), and a moderatereward(0.5) otherwise.
4. Discharge and Sell
This action implies the agent is discharging its battery and selling the energy to the grid or P2P market.
Where:
- : The
rewardis highest ifgenerationexceedsload(surplus), batterySoCis at least 20%, AND it's apeaktariff period. This strongly incentivizes selling stored energy duringpeaktimes. Theconfidence scorealso boosts thereward. - : A moderate
rewardifgenerationexceedsloadand batterySoCis very high (at or above 90%), even if notpeaktime.
5. Discharge and Buy
This action implies the agent is discharging its battery, but also buying energy, perhaps to meet immediate load or for other operational reasons. This combination seems unusual without more context but could represent a complex scenario where internal load exceeds generation even after discharging, or for specific grid stability services.
Where:
- : The condition is that
generationis less thanload(agent needs energy) AND batterySoCis at least 10%. - The
rewardis boosted byconfidenceand an additional factor if it's apeaktariff period. A moderatereward(0.5) otherwise. This action likely means meeting high demand from battery during peak periods while potentially still needing to purchase a small amount, or to avoid hitting very lowSoCwhile still optimizing forpeakconditions.
6. Self-Consumption
This action means the agent is directly consuming its locally generated energy, minimizing reliance on the grid or P2P market.
Where:
- : The highest
rewardifgenerationclosely matchesload(within 0.1 units), meaning highself-sufficiency. This is further boosted if it occurs during apeaktariff period (avoiding expensive grid purchases). - : A moderate
rewardif thegeneration-loadmismatch is slightly larger (between 0.1 and 0.2 units). This encourages balancinggenerationandloadlocally.
7. Self and Charge This action implies the agent is consuming its local generation, and any surplus is used to charge its battery.
Where:
- : The highest
rewardifgenerationexceedsload(surplus), battery is not full, AND it's anon-peaktariff period. This strongly incentivizes using local surplus to charge the battery duringoff-peaktimes, benefiting from theconfidence score. - : A moderate
rewardifgenerationexceedsloadand battery is not full, regardless of tariff period. - : No
reward(0) if it's apeaktariff period, discouraging charging when it might be more beneficial to sell or meet peak demand.
8. Self and Discharge
This action implies the agent is consuming its local generation and also discharging its battery to meet its load.
Where:
- : The condition is that
generationis less thanload(agent needs energy) AND batterySoCis at least 20%. - The
rewardis boosted byconfidenceand an additional factor if it's apeaktariff period. This strongly incentivizes using stored energy to meet local demand duringpeaktimes. A moderatereward(0.5) otherwise.
5. Experimental Setup
5.1. Datasets
The P2P energy trading community is simulated based on real-world characteristics and synthetic data generation:
- Prosumer Community: Comprises 10 rural Finnish
prosumers.- Dairy Farms: 4 dairy farms, using data from Uski et al. [31]. This likely includes specific load profiles characteristic of agricultural operations.
- Households: 6 households, with synthetic loads generated based on Finnish load profiles and seasonal multipliers [10, 29]. This ensures realistic daily and seasonal variations in household energy consumption.
- Electric Vehicles (EVs): 2 of the 6 households own
EVs, introducing dynamic and flexible load components due to charging patterns.
- PV Generation Data:
Photovoltaic (PV)generation is simulated using theSystem Advisor Model (SAM)[20], a performance and financial model for renewable energy projects developed by the National Renewable Energy Laboratory (NREL). This provides realisticPVoutput data considering local solar irradiance and weather patterns. - Renewable Capacity: Consistent with [30], the renewable capacities (primarily
PV) are set to 40% of the annualloadfor eachprosumer, providing a significant but not overwhelming amount of local generation. - Domain: The dataset specifically represents a rural Finnish energy community, implying characteristics like high-latitude solar patterns and potentially colder climates.
5.2. Evaluation Metrics
The paper evaluates the performance of the proposed framework using several key performance indicators (KPIs) and specific probabilistic forecasting metrics.
5.2.1. Key Performance Indicators (KPIs) for Trading
The primary KPIs used to assess the economic and operational benefits of the P2P trading system are:
-
Electricity Purchase Cost:
- Conceptual Definition: This metric quantifies the total expenditure incurred by all
prosumersin the community for purchasing electricity from either the external grid or otherP2Pparticipants. Lower costs indicate better economic efficiency and optimization of energy resources. - Mathematical Formula: Not explicitly provided in the paper, but conceptually, it would be the sum of (quantity bought from grid * grid purchase price) + (quantity bought from P2P * P2P purchase price) over the simulation period for all agents.
- Symbol Explanation:
Quantity Boughtrefers to the amount of electricity purchased.Grid Purchase Priceis the tariff for buying from the main grid.P2P Purchase Priceis the price paid within theP2Pmarket (e.g.,IBP).
- Conceptual Definition: This metric quantifies the total expenditure incurred by all
-
Electricity Sales Revenue:
- Conceptual Definition: This metric quantifies the total income generated by all
prosumersfrom selling their surplus electricity to either the external grid or otherP2Pparticipants. Higher revenue indicates more effective utilization of local generation and better market participation. - Mathematical Formula: Not explicitly provided, but conceptually, it would be the sum of (quantity sold to grid * grid sale price) + (quantity sold to P2P * P2P sale price) over the simulation period for all agents.
- Symbol Explanation:
Quantity Soldrefers to the amount of electricity sold.Grid Sale Priceis the tariff for selling to the main grid.P2P Sale Priceis the price received within theP2Pmarket (e.g.,ISP).
- Conceptual Definition: This metric quantifies the total income generated by all
-
Peak Hour Grid Demand Reduction:
- Conceptual Definition: This metric measures the reduction in the maximum amount of electricity imported from the main grid during
peak tariff hours. Reducingpeak demandis crucial forgrid stability, avoiding costly infrastructure upgrades, and often for reducing reliance on carbon-intensivepeaker plants. - Mathematical Formula: Not explicitly provided, but conceptually, it would be a comparison between the maximum hourly grid import observed in a baseline scenario versus the proposed model during specified
peak hours. - Symbol Explanation:
Peak Hour Demandrefers to the highest aggregate electricity consumption from the grid specifically during designatedpeak periods.
- Conceptual Definition: This metric measures the reduction in the maximum amount of electricity imported from the main grid during
5.2.2. Probabilistic Forecasting Metrics (for KTU model)
The quality and calibration of the probabilistic forecasts generated by the KTU model are assessed using standard metrics:
-
Prediction Interval Coverage Probability (PICP):
- Conceptual Definition:
PICPmeasures the percentage of actual observations that fall within the predictedprediction interval(e.g., a 90% confidence interval). For a well-calibratedprobabilistic forecastat a 90% confidence level, thePICPshould be close to 90%. If it's too low, the model is underconfident; if too high, it's overconfident or the intervals are too wide. - Mathematical Formula: $ \mathrm{PICP} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}(y_i \in [\mathrm{LBI}_i, \mathrm{UBI}_i]) \times 100% $
- Symbol Explanation:
- : Total number of observations.
- : Indicator function, which is 1 if the condition inside the parenthesis is true, and 0 otherwise.
- : The actual observed value for the -th instance.
- : The
Lower BoundandUpper Boundof theprediction intervalfor the -th instance.
- Conceptual Definition:
-
Mean Prediction Interval Width (MPIW):
- Conceptual Definition:
MPIWmeasures the average width of theprediction intervalsacross all forecasts. For a givenPICP, a narrowerMPIWis generally preferred, as it indicates more preciseforecasts. However,MPIWmust be considered in conjunction withPICPto avoid overly narrow intervals that don't cover enough observations. - Mathematical Formula: $ \mathrm{MPIW} = \frac{1}{N} \sum_{i=1}^{N} (\mathrm{UBI}_i - \mathrm{LBI}_i) $
- Symbol Explanation:
- : Total number of observations.
- : The
Upper Boundof theprediction intervalfor the -th instance. - : The
Lower Boundof theprediction intervalfor the -th instance.
- Conceptual Definition:
-
Continuous Ranked Probability Score (CRPS):
- Conceptual Definition:
CRPSis a proper scoring rule that evaluates the quality ofprobabilistic forecasts. It measures the "distance" between the predictedcumulative distribution function (CDF)and theempirical CDFof the observation. UnlikePICPandMPIW, which evaluate the interval,CRPSevaluates the entireprobabilistic forecastdistribution. A lowerCRPSindicates a better, more accurate, and well-calibratedprobabilistic forecast. - Mathematical Formula: For a continuous
probabilistic forecastand an observed value , theCRPSis defined as: $ \mathrm{CRPS}(F, y) = \int_{-\infty}^{\infty} (F(x) - \mathbf{1}(x \ge y))^2 , dx $ For aGaussian distribution, theCRPScan be calculated analytically as: $ \mathrm{CRPS}(\mu, \sigma, y) = \sigma \left( \frac{1}{\sqrt{\pi}} - 2 \Phi\left(\frac{y-\mu}{\sigma}\right) \frac{y-\mu}{\sigma} - \left( 2\Phi\left(\frac{y-\mu}{\sigma}\right) - 1 \right)^2 \right) $ or, more commonly, a simplified form for where are independent draws from . For a Gaussian, the closed form is also: $ \mathrm{CRPS}(\mu, \sigma, y) = \sigma \left( \frac{1}{\sqrt{\pi}} \exp\left(-\frac{z^2}{2}\right) + z(2\Phi(z)-1) \right) $ where . - Symbol Explanation:
F(x): Thecumulative distribution function (CDF)of theprobabilistic forecast.- : The actual observed value.
- : Indicator function, which is 1 if , and 0 otherwise. This forms the
empirical CDFof the observed value. - : The
meanof the predictedNormal distribution. - : The
standard deviationof the predictedNormal distribution. - : The
cumulative distribution functionof the standardNormal distribution. - : The
standardized scoreorz-scorefor the observation.
- Conceptual Definition:
5.3. Baselines
The proposed uncertainty-aware DQN model is compared against several baselines to demonstrate its superior performance:
- Rule Based: A simple rule-based approach, likely employing predefined heuristics for energy management (e.g., charge when
PVis high, discharge whenloadis high). These rules are described in previous work [27, 25]. - RB+QL (Rule Based + Q-Learning Ensemble): An ensemble approach combining
rule-based strategieswithQ-learning. This suggests thatQ-learningmight be used to refine decisions within a rule-based framework or learn when to activate certain rules. These algorithms are described in [27, 25]. - DQN (Standard Deep Q-Network): A baseline
DQNmodel that does not incorporateuncertainty-aware forecasts. It relies on deterministic (point) forecasts forloadandgeneration. This comparison highlights the specific benefit of integrating uncertainty information. - DQN Forecasting (This paper's proposed model): The
DQNmodel that integrates theuncertainty-aware forecastsfrom theKTUmodel. The paper refers to this as "DQN Forecasting" to distinguish it from the standardDQN. - Other
MARLAlgorithms: The paper mentions that otherMARLalgorithms, such asProximal Policy Optimization (PPO), were also evaluated butDQN"consistently outperformed them." This indicates a comprehensive baseline comparison, even if detailed results forPPOare not presented.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that the integration of uncertainty-aware forecasting into the MARL DQN framework (DQN Forecasting) leads to significant improvements in P2P energy trading compared to traditional approaches.
6.1.1. Reward Convergence
Figure 3 illustrates the episode reward trajectories for the 10 agents over 1.8 million time steps.
该图像是图表,展示了在2百万时间步内多个代理的奖励收敛情况。不同颜色的曲线代表不同代理的累计奖励,显示与时间步的关系,表明随着时间的推移,奖励逐渐趋于稳定,最终达到各自的最大值。
Figure 3. Reward Convergence over 2M time steps
- Faster Convergence: All
agentsshow a rapid increase inepisode rewardduring early training. A key finding is that the proposeduncertainty-aware forecasting DQNachieves convergence approximately 50% faster than the standardDQN(though the standardDQNtrajectory is not shown in this specific figure, it's inferred from the text comparison). It reaches high performance around 600,000 steps, requiring about 25% fewertime steps. This efficiency is attributed to theprobabilistic forecastswhich allow the model to anticipate future states, thus narrowing the exploration space and guidingagentstowards optimalpoliciesmore effectively. - Stability: The
reward curvesshow stable convergence, indicating that theagentssuccessfully learn effectiveenergy managementstrategies. - Scalability: The modular and parallelizable environment, with a linear-complexity auction and communication-free agents, suggests the approach is scalable for larger, distributed systems, addressing a common concern in
MARLapplications to real-world microgrids.
6.1.2. Battery Management
Figure 4 presents the average daily battery State of Charge (SoC), load, and generation over a year.
该图像是一个图表,展示了日平均电池状态(SOC)、负载和发电量随时间变化的趋势。图中蓝线表示平均电池百分比,粉红线表示平均负载,橙线表示平均发电量,各项数据在一天24小时内呈现出明显的波动特征。
Figure 4. Daily Battery SOC, Load & Generation (Year Avg.)
- Anticipatory Behavior: The figure shows that
battery percentage(blue line) steadily rises from early morning, peaking in the late afternoon. This correlates with theaverage generation(orange line) pattern, indicating thatagentsare coordinating charging during periods of highrenewable generation(e.g.,PVoutput during sunny hours). - Strategic Discharge: The
battery SoCthen declines as stored energy is dispatched to meet eveningloads(pink line), particularly before and duringpeak hours. Thispre-chargingandstrategic dischargingbehavior reduces reliance on the utilitygridduringpeak tariff hours, contributing to cost-effectiveness and lower carbon emissions associated withpeak hour generation. - Improved Utilization: This
anticipatory behavior, informed byuncertainty-aware forecasts, contrasts with potentially more reactive strategies of standardDQNorrule-based methods. It leads to more effective and community-beneficialstorage utilization.
6.1.3. Key Performance Indicators (KPIs)
The following are the results from Table 1 of the original paper:
| Metric | Scenario | Rule Based | RB+QL | DQN | DQN Forecasting | % Diff (DQN vs DQN Forecasting) |
|---|---|---|---|---|---|---|
| Electricity Cost (Bought) (€) | w/o P2P | 125400 | 121300 | 105000 | 99100 | -5.7% |
| with P2P | 119500 | 116800 | 102100 | 96800 | -3.2% | |
| P2P vs w/o P2P (%) | -4.7% | -3.7% | -2.8% | -2.9% | ||
| Electricity Revenue (Sold) (€) | w/o P2P | 3600 | 3800 | 7850 | 8350 | +6.4% |
| with P2P | 4400 | 4650 | 14450 | 20900 | +44.7 % | |
| P2P vs w/o P2P (%) | +22.2% | +22.4% | +84.1% | +150.1% | ||
| Peak Hour Demand (kW) | w/0 P2P | 36000 | 34500 | 23200 | 14200 | -38.8% |
| with P2P | 28500 | 26600 | 21850 | 11900 | -45.6% | |
| P2P vs w/o P2P (%) | -20.8% | -22.9% | -5.8% | -16.2% |
Analysis of KPIs:
-
Electricity Cost (Bought):
- Without P2P:
DQN Forecastingreduces costs by 5.7% (€105,000 to €99,100) compared to standardDQN. - With P2P:
DQN Forecastingreduces costs by 3.2% (€102,100 to €96,800) compared to standardDQN. - Impact of P2P:
P2Ptrading itself significantly reduces costs across all models (e.g., standardDQNfrom €105,000 to €102,100). The combination ofDQN ForecastingwithP2Pyields the lowest absolute cost (€96,800). - Overall: The
uncertainty-awareapproach consistently lowers energy purchase costs, showing the economic benefit ofrisk-informeddecisions.
- Without P2P:
-
Electricity Revenue (Sold):
- Without P2P:
DQN Forecastingincreases revenue by 6.4% (€7,850 to €8,350) compared to standardDQN. - With P2P:
DQN Forecastingachieves a remarkable 44.7% increase (€14,450 to €20,900) compared to standardDQN. - Impact of P2P:
P2Ptrading has a massive positive impact on revenue (e.g., standardDQNrevenue jumps by 84.1% from €7,850 to €14,450). WhenP2Pis enabled, theuncertainty-awaremodel allows agents to exploit selling opportunities far more effectively, resulting in the highest revenue (€20,900). - Overall: The most significant gain is in sales revenue, especially when
P2Ptrading is active, indicating thatuncertainty-aware agentsare better at identifying and leveraging surplus energy for profit within the community.
- Without P2P:
-
Peak Hour Demand (kW):
- Without P2P:
DQN Forecastingreducespeak hour grid demandby 38.8% (from 23,200 kW to 14,200 kW) compared to standardDQN. - With P2P:
DQN Forecastingachieves an even greater reduction of 45.6% (from 21,850 kW to 11,900 kW) compared to standardDQN. - Impact of P2P:
P2Ptrading also contributes topeak demandreduction (e.g., standardDQNreduces it by 5.8% from 23,200 kW to 21,850 kW). The combination withuncertainty-aware forecastingresults in the lowestpeak demand(11,900 kW), demonstrating strongdemand-side managementandgrid-stress mitigation. - Overall: This metric highlights the operational benefits for the
gridand community, showing thatuncertainty-awaremodels, especially withP2P, can significantly flatten thepeak loadprofile.
- Without P2P:
6.1.4. Comparison with Baselines
Rule-basedand methods show incremental improvements but lack the adaptability and predictive power ofDQN-based approaches.- Standard
DQNprovides a substantial improvement overrule-basedmethods, demonstrating the power ofRLin this domain. - However,
DQN Forecasting(the proposed model) consistently outperforms standardDQNacross allKPIs, both with and withoutP2Ptrading. This validates the core hypothesis thatuncertainty-aware forecastingenhancesRLperformance. - The improvements are "even more pronounced when P2P trading is enabled," underscoring the
synergybetween advanced forecasting and market mechanisms.
6.2. Ablation Studies / Parameter Analysis
While the paper doesn't present formal ablation studies in a dedicated section, the comparison between "DQN" and "DQN Forecasting" effectively serves as an ablation on the uncertainty-aware forecasting component. The DQN baseline represents a DQN model that likely uses deterministic forecasts (or simpler forecasting methods), while DQN Forecasting integrates the KTU model's probabilistic forecasts. The consistent and significant performance gains of DQN Forecasting over DQN across all metrics (up to 44.7% revenue increase, 45.6% peak demand reduction) strongly validate the effectiveness of incorporating uncertainty-aware forecasting into the MARL framework.
The automated hyperparameter optimization via Optuna also indicates a systematic approach to finding optimal parameters for the KTU model, which contributes to its performance.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper successfully introduces a novel framework for P2P energy trading that synergistically combines uncertainty-aware knowledge transformer forecasting (KTU) with multi-agent deep reinforcement learning (MARL DQN). By explicitly quantifying and integrating prediction uncertainty into the MARL state space and reward functions, the proposed approach enables agents to make risk-sensitive and anticipatory decisions in stochastic energy environments.
The key achievements include:
-
Faster Convergence: Approximately 50% faster
MARLtraining convergence. -
Efficient Battery Management: Smarter
charging and dischargingstrategies based onforecasted generationandloadand theiruncertainty. -
Significant Economic Benefits: Energy purchase costs reduced by up to 5.7% (without
P2P) and 3.2% (withP2P). Electricity sales revenue increased by 44.7% (withP2P). -
Operational Resilience: Peak hour
grid demandreduced by up to 45.6% (withP2P).These results establish a new benchmark for
resilient,efficient, andsustainable P2P energy trading systems, underscoring thatuncertainty-aware learningis crucial for maximizing both economic and operational gains in decentralized energy communities.
7.2. Limitations & Future Work
The authors acknowledge several areas for future research:
- Integration of Additional Market Mechanisms: Exploring more complex or novel market designs within the
P2Pframework. - Real-World Pilot Deployments: Moving beyond simulation to test the framework in actual
microgridsorP2Pcommunities to validate its performance and address practical challenges. - Optimization of the Forecasting Horizon: Investigating how different
forecasting horizons(e.g., short-term vs. long-term) impactMARLdecisions and overall system performance, and optimizing this parameter. - Theoretical Analysis or Guarantees: Conducting more rigorous theoretical analysis regarding the
convergence propertiesof theuncertainty-aware MARLsystem, and potentially providingperformance guarantees.
7.3. Personal Insights & Critique
This paper presents a highly relevant and innovative contribution to the field of P2P energy trading. The explicit focus on uncertainty quantification and its direct integration into MARL decision-making is a significant step forward, addressing a critical weakness in much of the existing literature. The combination of Transformer architectures for forecasting with DQN for control, coupled with domain-specific regularization and reward functions, demonstrates a well-engineered and practical solution.
Inspirations and Applications:
- Robustness in Stochastic Systems: The methodology for incorporating
uncertainty estimatesinto thestate spaceandreward functioncould be applied to other domains characterized by high stochasticity and sequential decision-making, such as supply chain management, autonomous driving (e.g., predicting other agents' uncertain movements), or financial trading. - Proactive Resource Management: The
anticipatory battery management(pre-charging forpeak hours) learned byagentshighlights the power ofprobabilistic forecaststo enable proactive rather than reactive resource optimization. This could be beneficial for any resource management problem where future supply/demand is uncertain but critical. - Synergy of AI Components: The paper effectively demonstrates the synergy between advanced
deep learningmodels (for forecasting) andreinforcement learning(for control), showing how combining specializedAIcomponents can lead to superior overall system performance.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
Complexity of Reward Function: While detailed, the
reward functionsare quite complex and hand-engineered. Their optimality is assumed based on improvedKPIs. In real-world scenarios, designing such intricatereward functionscan be challenging and might require extensive domain expertise or iterative tuning.Inverse Reinforcement Learningcould be an alternative to infer rewards from desired expert behavior. -
Scalability of Double Auction: The paper mentions that the
double auctionhas linear complexity and is scalable. However, for very large numbers ofprosumers(e.g., thousands), a centralizedauctioneermight still face computational or communication bottlenecks. DistributedP2Pmarket clearing mechanisms orblockchain-based solutions might need to be explored further to maintain true decentralization and scalability in future work. -
Generalizability of Uncertainty Model: The
KTUmodel's effectiveness relies on the quality of itsuncertainty quantification. WhilePICP,MPIW, andCRPSare good metrics, their performance for very rare or extreme events (e.g., sudden equipment failures, extreme weather events) might need further investigation. TheGaussian assumptionforloadandPVdistribution might not always hold perfectly for all energy phenomena. -
Real-world Implementation Challenges: Moving from simulation to real-world deployment (as suggested for future work) will introduce practical challenges such as data privacy concerns, communication latency, hardware limitations for
DERcontrol, and regulatory frameworks. The current model relies on acentralized auctioneer, which may raise privacy concerns despite sharing only aggregated data. Fully decentralized market mechanisms could be a more robust approach in practice. -
Interpretability of
DQNActions: While the results show improved performance, the exactpolicylearned by theDQNagents (e.g., why a specificbuyorsellaction was chosen at a precise moment given theuncertainty) might still lack fullinterpretability. For critical infrastructure like energy grids,explainable AIis gaining importance. -
Computational Cost: Training
Transformer-based models andMARLDQNsystems can be computationally intensive. The paper mentions faster convergence, but the absolute training time and resource requirements could still be substantial, especially for scaling to larger communities or more complex environments.Overall, this paper makes a compelling case for the necessity and benefits of
uncertainty-aware MARLinP2Penergy trading, offering a robust framework that pushes the boundaries of current research.
Similar papers
Recommended via semantic vector search.