Paper status: completed

Uncertainty-Aware Knowledge Transformers for Peer-to-Peer Energy Trading with Multi-Agent Reinforcement Learning

Published:07/23/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper presents a novel peer-to-peer energy trading framework combining uncertainty-aware prediction with multi-agent reinforcement learning. The KTU model quantifies uncertainty, optimizing decisions and improving cost-efficiency and revenue, highlighting its potential for r

Abstract

This paper presents a novel framework for Peer-to-Peer (P2P) energy trading that integrates uncertainty-aware prediction with multi-agent reinforcement learning (MARL), addressing a critical gap in current literature. In contrast to previous works relying on deterministic forecasts, the proposed approach employs a heteroscedastic probabilistic transformer-based prediction model called Knowledge Transformer with Uncertainty (KTU) to explicitly quantify prediction uncertainty, which is essential for robust decision-making in the stochastic environment of P2P energy trading. The KTU model leverages domain-specific features and is trained with a custom loss function that ensures reliable probabilistic forecasts and confidence intervals for each prediction. Integrating these uncertainty-aware forecasts into the MARL framework enables agents to optimize trading strategies with a clear understanding of risk and variability. Experimental results show that the uncertainty-aware Deep Q-Network (DQN) reduces energy purchase costs by up to 5.7% without P2P trading and 3.2% with P2P trading, while increasing electricity sales revenue by 6.4% and 44.7%, respectively. Additionally, peak hour grid demand is reduced by 38.8% without P2P and 45.6% with P2P. These improvements are even more pronounced when P2P trading is enabled, highlighting the synergy between advanced forecasting and market mechanisms for resilient, economically efficient energy communities.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Uncertainty-Aware Knowledge Transformers for Peer-to-Peer Energy Trading with Multi-Agent Reinforcement Learning

1.2. Authors

Mian Ibad Ali Shaha, Enda Barretta, and Karl Masona Affiliation: aSchool of Computer Science, University of Galway, Ireland. ORCID for Mian Ibad Ali Shah: https://orcid.org/0009-0008-7288-9757

1.3. Journal/Conference

The paper is published as a preprint on arXiv. The abstract implies it's a novel framework, suggesting it's likely intended for a conference or journal in the fields of energy systems, artificial intelligence, or reinforcement learning. The specific venue is not explicitly stated beyond its arXiv publication.

1.4. Publication Year

The publication date provided is 2025-07-22T17:46:28.000Z. This indicates a future publication date, suggesting it is a preprint (e.g., v1) that has been submitted or is awaiting formal peer review and publication.

1.5. Abstract

This paper introduces a novel framework for Peer-to-Peer (P2P) energy trading that integrates uncertainty-aware prediction with multi-agent reinforcement learning (MARL). Unlike previous works that rely on deterministic forecasts, the proposed approach uses a heteroscedastic probabilistic transformer-based prediction model called Knowledge Transformer with Uncertainty (KTU). The KTU model quantifies prediction uncertainty for robust decision-making in stochastic P2P energy trading environments, using domain-specific features and a custom loss function to provide reliable probabilistic forecasts and confidence intervals. By integrating these uncertainty-aware forecasts into a MARL framework, agents can optimize trading strategies with a clear understanding of risk and variability. Experimental results demonstrate that an uncertainty-aware Deep Q-Network (DQN) significantly reduces energy purchase costs (up to 5.7% without P2P, 3.2% with P2P), increases electricity sales revenue (6.4% without P2P, 44.7% with P2P), and decreases peak hour grid demand (38.8% without P2P, 45.6% with P2P). These improvements are amplified with P2P trading, emphasizing the synergy between advanced forecasting and market mechanisms for efficient and resilient energy communities.

https://arxiv.org/abs/2507.16796 (Publication Status: Preprint on arXiv) PDF Link: https://arxiv.org/pdf/2507.16796v1.pdf

2. Executive Summary

2.1. Background & Motivation

The global energy landscape is rapidly shifting towards distributed energy resources (DERs), decarbonization, and decentralized energy markets, with Peer-to-Peer (P2P) energy trading emerging as a promising paradigm. P2P trading empowers prosumers (consumers who also produce energy, typically from renewable energy sources like solar PV) to directly exchange electricity, optimize local renewable energy utilization, and reduce carbon emissions.

However, a central challenge in operating P2P energy markets is the inherent uncertainty associated with renewable generation (e.g., solar, wind) and dynamic load profiles (consumer demand). This variability introduces significant risks that traditional deterministic forecasting methods fail to capture, leading to suboptimal trading and dispatch decisions. Current Reinforcement Learning (RL) and Multi-Agent Reinforcement Learning (MARL) approaches in P2P energy trading often overlook this crucial uncertainty, relying on point forecasts that do not reflect the full spectrum of possible future scenarios. This creates a critical gap in the literature: how to enable risk-informed decision-making in highly stochastic energy environments.

The problem is important because unmanaged uncertainty can undermine both the economic efficiency and system reliability of P2P energy communities, especially with high renewable energy penetration. Additionally, the growing integration of carbon markets into P2P trading requires robust systems that can align economic incentives with broader decarbonization goals.

The paper's entry point is to bridge this gap by integrating uncertainty-aware prediction directly into the MARL framework for P2P energy trading. It aims to provide prosumers with a clear understanding of future energy supply and demand variability, enabling them to make more robust and risk-sensitive trading decisions.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

  • Novel Uncertainty-Aware Forecasting Model (KTU): It introduces the Knowledge Transformer with Uncertainty (KTU) model, a heteroscedastic probabilistic transformer-based prediction model specifically designed for P2P energy trading. This model explicitly quantifies prediction uncertainty for both load and PV generation, providing not just point estimates but also confidence intervals.
  • Integration of Uncertainty into MARL: The framework seamlessly integrates these uncertainty-aware forecasts into a Multi-Agent Deep Q-Network (DQN) system. This allows MARL agents to make risk-sensitive trading decisions by understanding the variability and risks associated with future energy availability.
  • Custom Loss Function and Physics-Informed Constraints: The KTU model is trained with a custom loss function that combines Gaussian negative log-likelihood with domain-specific regularization terms, ensuring reliable probabilistic forecasts. It also incorporates physics-informed constraints to ensure plausibility (e.g., no PV generation at night).
  • Enhanced Reward Structure for Economic and Environmental Objectives: The reward structure for the MARL agents incorporates forecast uncertainty, tariff periods, and battery constraints, effectively addressing both economic objectives (cost reduction, revenue generation) and environmental goals (peak tariff management, carbon emissions indirectly through peak demand reduction).
  • Automated Hyperparameter Optimization: The approach utilizes Optuna for automated hyperparameter optimization, ensuring optimal performance for both the forecasting and trading modules.
  • Significant Performance Improvements: The experimental results demonstrate substantial improvements over traditional DQN and other baselines:
    • Reduced Energy Purchase Costs: Up to 5.7% reduction without P2P trading and 3.2% with P2P trading.

    • Increased Electricity Sales Revenue: 6.4% increase without P2P and a remarkable 44.7% increase with P2P trading.

    • Reduced Peak Hour Grid Demand: 38.8% reduction without P2P and 45.6% with P2P trading.

    • Faster Convergence: The uncertainty-aware DQN achieves convergence approximately 50% faster, requiring about 25% fewer time steps than standard DQN.

    • Improved Battery Management: More effective and community-beneficial storage utilization through anticipatory charging during high renewable generation and discharging during peak demand.

      These findings collectively establish a new benchmark for resilient, efficient, and sustainable P2P energy systems, highlighting the critical role of uncertainty-aware learning for both economic and operational gains in decentralized energy contexts.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Peer-to-Peer (P2P) Energy Trading

P2P energy trading refers to a decentralized energy market model where individual energy producers (prosumers) and consumers can directly exchange electricity with each other, rather than solely relying on a central utility grid. This allows prosumers with distributed energy resources (DERs) like solar panels (PV) and battery storage to sell their surplus energy to neighbors or buy energy when needed, optimizing local energy consumption and generation. It promotes local renewable energy utilization, reduces reliance on the main grid, and can lead to cost savings and increased energy independence for communities.

3.1.2. Prosumer

A prosumer is an individual or entity that both produces and consumes energy. In the context of P2P energy trading, this typically refers to households or businesses equipped with DERs (e.g., rooftop solar panels, small wind turbines, battery storage) that can generate their own electricity, consume it, and also trade any surplus or deficit with other participants in a local energy community.

3.1.3. Distributed Energy Resources (DERs)

DERs are small-scale power generation or storage technologies that are located at or near the point of energy consumption. Examples include rooftop solar photovoltaic (PV) systems, wind turbines, battery energy storage systems (BESS), and electric vehicles (EVs) with vehicle-to-grid capabilities. DERs are key components of decentralized energy systems and enable P2P energy trading.

3.1.4. Reinforcement Learning (RL)

Reinforcement Learning (RL) is a paradigm of machine learning where an agent learns to make optimal decisions by interacting with an environment. The agent performs actions in a state, receives rewards or penalties based on its actions, and transitions to a new state. The goal of the agent is to learn a policy (a mapping from states to actions) that maximizes the cumulative reward over time. RL is well-suited for sequential decision-making problems.

3.1.5. Multi-Agent Reinforcement Learning (MARL)

Multi-Agent Reinforcement Learning (MARL) extends RL to scenarios involving multiple agents interacting within the same environment. In MARL, each agent learns its own policy while considering the actions and policies of other agents. This can lead to complex dynamics, as agents might be cooperative, competitive, or a mix of both. MARL is particularly relevant for P2P energy trading, where multiple prosumers (each an agent) make independent decisions that affect the overall market.

3.1.6. Deep Q-Networks (DQN)

Deep Q-Networks (DQN) is an RL algorithm that combines Q-learning with deep neural networks. Traditional Q-learning uses a Q-table to store the maximum expected future rewards for taking a specific action in a given state. However, Q-tables become impractical for environments with large or continuous state-action spaces. DQN addresses this by using a deep neural network to approximate the Q-values, enabling RL to handle high-dimensional environments effectively. Key features of DQN include experience replay (storing and sampling past experiences) and a separate target network (a copy of the main network that is updated periodically) to stabilize training.

3.1.7. Transformers

Transformers are a neural network architecture introduced in 2017, initially for natural language processing, but later adapted for various tasks, including time series forecasting. Their key innovation is the self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when processing each element, capturing long-range dependencies efficiently without relying on sequential processing (like Recurrent Neural Networks).

3.1.8. Probabilistic Forecasting

Unlike deterministic forecasting, which provides a single point estimate for a future value (e.g., "the load will be 10 kW"), probabilistic forecasting provides a range of possible future values along with their associated probabilities. This can be expressed as a probability distribution or as prediction intervals (e.g., "there is a 90% chance the load will be between 8 kW and 12 kW"). Probabilistic forecasts are crucial for risk-aware decision-making in uncertain environments like energy markets.

3.1.9. Heteroscedasticity

In statistics, heteroscedasticity refers to a situation where the variability (or variance) of the error term in a regression model is unequal across the range of predicted values. In probabilistic forecasting, a heteroscedastic model means that the predicted uncertainty (e.g., the width of the confidence interval) can vary depending on the input or predicted value. For example, a heteroscedastic energy forecasting model might predict a wider confidence interval for PV generation during cloudy periods (higher uncertainty) than during clear, sunny periods (lower uncertainty). This is more realistic than homoscedastic models which assume constant variance.

3.1.10. Confidence Intervals / Prediction Intervals

A confidence interval (or prediction interval in forecasting) provides a range of values within which the true value of a parameter or future observation is expected to lie with a certain probability. For example, a 95% prediction interval for energy load means that there is a 95% chance that the actual load will fall within that specified range. These intervals are vital for risk management, allowing decision-makers to understand the potential variability and make more robust plans.

3.1.11. Double Auction (DA)

A double auction is a market mechanism where both buyers and sellers submit bids and offers simultaneously. Buyers submit bids indicating the maximum price they are willing to pay for a certain quantity, and sellers submit offers (asks) indicating the minimum price they are willing to accept for a certain quantity. An auctioneer then matches buyers and sellers whose bids are higher than or equal to the offers, clearing the market at an agreed-upon price. This mechanism is common in financial markets and is adapted here for P2P energy trading.

3.2. Previous Works

The paper frames its contribution by contrasting it with several types of prior research:

  • Early P2P Market Mechanisms: Zhou et al. [41] showed that early community market mechanisms often applied uniform prices, which limited individualized incentives. Zheng et al. [38] introduced auction-based methods for trader-specific pricing, but these struggled with real-world uncertainties in trader behavior and energy supply.
  • RL in P2P Energy Trading: May et al. [19] demonstrated MARL as a promising solution for agents to learn optimal strategies in dynamic environments. Bhavana et al. [3] identified persistent technical challenges regarding scalability and uncertainty management. Bassey et al. [2] investigated AI applications in trading strategy optimization. However, a significant limitation highlighted is that most of these implementations rely on deterministic forecasts, failing to capture inherent variability. Zhang et al. [37] explicitly showed that forecasting errors significantly impact market efficiency, underscoring the need for uncertainty-aware models.
  • Transformer Architectures for Energy Forecasting: Liu et al. [17] showed promising results for Transformers in energy forecasting. However, these approaches primarily address single-agent settings or produce deterministic outputs without uncertainty quantification.
  • DQN-based Approaches: Chen et al. [5] developed a DQN-based approach for price prediction, but again, without uncertainty quantification.
  • Uncertainty-Aware but Lacking Integration: El et al. [9] investigated uncertainty-aware prosumer coalitional games but did not integrate probabilistic forecasting with multi-agent learning. Yazdani et al. [36] proposed robust optimization for real-time trading, and Uthayansuthi et al. [32] combined clustering, forecasting, and deep reinforcement learning. Yet, these approaches either lack advanced neural forecasting integration or focus primarily on economic optimization without fully considering the impact of uncertainty on MARL decision-making.

3.2.1. The Attention Mechanism (from "Attention Is All You Need" [33])

Since the paper uses a Transformer-based model, Attention is a foundational concept. The core idea behind Attention is to allow the model to weigh the importance of different parts of the input sequence when processing each element. This mechanism captures long-range dependencies without being constrained by sequential processing, which is a limitation of Recurrent Neural Networks (RNNs).

The Attention mechanism takes three inputs: a Query (Q), Keys (K), and Values (V). These are typically derived from the same input sequence in self-attention (as used in Transformers). The Query represents what we are looking for, the Keys represent what is available, and the Values are the actual information associated with the Keys.

The formula for Scaled Dot-Product Attention is:

Attention(Q,K,V)=softmax(QKTdk)V \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Where:

  • QQ: The query matrix, representing the current element for which we want to compute the attention. Its shape is typically (batchsize,sequencelength,dk)(batch_size, sequence_length, d_k).
  • KK: The key matrix, representing all elements in the sequence that the query will be compared against. Its shape is typically (batchsize,sequencelength,dk)(batch_size, sequence_length, d_k).
  • VV: The value matrix, containing the information to be extracted from the sequence, weighted by the attention scores. Its shape is typically (batchsize,sequencelength,dv)(batch_size, sequence_length, d_v).
  • dkd_k: The dimension of the keys (and queries). This scaling factor is used to prevent the dot product from growing too large, which could push the softmax function into regions with very small gradients.
  • KTK^T: The transpose of the key matrix.
  • QKTdk\frac{QK^T}{\sqrt{d_k}}: This calculates the attention scores by taking the dot product of the query with all keys, scaled by dk\sqrt{d_k}. This measures the similarity between the query and each key.
  • softmax()\mathrm{softmax}(\cdot): This function normalizes the attention scores to sum to 1, effectively creating a probability distribution over the values. This distribution indicates how much attention each value should receive.
  • VV: The values are then multiplied by these normalized attention scores, resulting in a weighted sum of the values. This weighted sum is the attention output, emphasizing the most relevant information from the input sequence.

3.3. Technological Evolution

The field has evolved from:

  1. Centralized Grid Management: Traditional energy systems with large power plants and a central utility managing supply and demand.
  2. Emergence of DERs: The rise of rooftop solar, batteries, and EVs leading to prosumers.
  3. Basic P2P Concepts: Initial ideas for P2P trading, often with simple pricing or rule-based strategies.
  4. RL/MARL for P2P: Application of RL to optimize individual prosumer decisions, then MARL for multi-prosumer interactions. These initially relied on deterministic forecasts.
  5. Transformer for Forecasting: Adoption of advanced deep learning architectures like Transformers for more accurate time series forecasting.
  6. Uncertainty Quantification: Recognition of the limitations of deterministic forecasts and the need for probabilistic forecasting.
  7. Integration of Uncertainty-Aware MARL (This Paper): This paper represents a crucial step in this evolution by combining advanced probabilistic Transformer-based forecasting with MARL to enable risk-sensitive decision-making in P2P energy markets, addressing economic efficiency and environmental goals simultaneously.

3.4. Differentiation Analysis

Compared to the main methods in related work, the core differences and innovations of this paper's approach are:

  • Explicit Uncertainty Quantification: Unlike most prior RL/MARL P2P trading models that rely on deterministic forecasts, this work explicitly quantifies the uncertainty in load and PV generation using a heteroscedastic probabilistic model (KTU). This is a fundamental shift from point estimates to probability distributions.
  • Integration of Uncertainty into MARL State Space: The uncertainty estimates generated by KTU are directly incorporated into the state space of the DQN agents. This allows MARL agents to make risk-sensitive decisions rather than just optimizing based on mean forecasts.
  • Novel KTU Model Design: The KTU model itself is an innovation, building on Transformer architectures with dual output heads for mean and variance and a custom loss function that includes domain-specific regularization (e.g., temporal smoothness, penalty for nighttime PV generation).
  • Comprehensive Reward Function: The reward function is designed to account for forecast uncertainty (confidence score), grid tariff periods (peak, non-peak), and battery constraints, aligning agent incentives with both economic and system stability goals.
  • Synergy of Advanced Forecasting and Market Mechanisms: The paper demonstrates that the benefits of uncertainty-aware forecasting are amplified when combined with P2P trading mechanisms, leading to more significant improvements in cost, revenue, and peak demand reduction.
  • Automated Optimization: The use of Optuna for hyperparameter tuning ensures robust and optimal performance of the KTU model.

4. Methodology

4.1. Principles

The core principle of this paper's methodology is to enable multi-agent reinforcement learning (MARL) agents in a Peer-to-Peer (P2P) energy trading environment to make risk-sensitive and anticipatory decisions by explicitly understanding the uncertainty of future energy generation and load. Instead of relying on single-point (deterministic) forecasts, the system uses probabilistic forecasts that provide both a predicted mean and an associated variance, reflecting the inherent stochasticity of renewable energy and consumer demand. This uncertainty information is then fed directly into the MARL state space, allowing Deep Q-Network (DQN) agents to optimize their trading and battery management strategies with a clearer understanding of potential risks and opportunities, ultimately leading to more robust, economically efficient, and community-beneficial outcomes.

4.2. Core Methodology In-depth (Layer by Layer)

The proposed framework, illustrated in Figure 1, integrates a Knowledge Transformer with Uncertainty (KTU) model for probabilistic forecasting with a Multi-Agent Deep Q-Network (DQN) system for P2P energy trading.

4.2.1. Data and Feature Engineering

The foundation of accurate forecasting and MARL training is robust data and feature engineering.

  • P2P Community: The study considers a P2P energy trading community comprising 10 rural Finnish prosumers. This includes 4 dairy farms (data from Uski et al. [31]) and 6 households (synthetic loads based on Finnish profiles and seasonal multipliers [10, 29]), with 2 households owning Electric Vehicles (EVs).
  • PV Generation: Photovoltaic (PV) generation is simulated using SAM [20] (System Advisor Model). Renewable capacities for PV are set to 40% of the annual load, following [30].
  • Data Aggregation and Normalization: Multi-prosumer data (load, generation) are aggregated and then normalized to bring them to a common scale, which is standard practice for neural networks.
  • Categorical Encoding: Prosumer size is encoded categorically, allowing the model to differentiate between different types or scales of participants.
  • Temporal and Seasonal Features:
    • Cyclical encodings (sine/cosine transformations) are applied to features like hour-of-day, day-of-week, and month-of-year to capture their cyclical nature and allow the model to understand periodicity (e.g., [15]). For example, hour HH can be encoded as sin(2πH/24)\sin(2\pi H/24) and cos(2πH/24)\cos(2\pi H/24).
    • One-hot encodings are used for other discrete temporal features.
  • Custom Daylight Feature: A custom daylight feature is derived from Helsinki's astronomical data to capture high-latitude solar patterns more accurately. This feature likely indicates whether it's daytime or nighttime, and potentially the intensity of daylight.
  • Supervised Learning Sequences: For the Transformer-based forecasting model, data sequences are constructed using a sliding window approach, which is common practice for time series forecasting with Transformers [35]. This involves creating input-output pairs where a sequence of past observations predicts a future sequence.

4.2.2. Knowledge Transformer with Uncertainty (KTU) Model

The KTU model is a heteroscedastic probabilistic transformer designed for energy forecasting. Its architecture builds on recent Transformer developments [24, 33].

The overall KTU-DQN Ensemble Architecture is depicted in Figure 1.

Figure 1. KTU-DQN Ensemble Architecture for P2P Energy Trading 该图像是KTU-DQN集成架构示意图,展示了用于P2P能源交易的框架。图中包括数据源、特征工程、知识转化器、概率预测、多智能体DQN及其交互,强调了各部分如何协同工作以实现优化与评估。

Figure 1. KTU-DQN Ensemble Architecture for P2P Energy Trading

The KTU model's key components and processes are:

  • Input Projection Layer: This layer receives the engineered features. It includes learnable positional encodings [39], which are crucial for Transformers to understand the order of elements in a sequence, as the core self-attention mechanism is permutation-invariant. This layer projects the input features into a higher-dimensional space (e.g., 128-dimensional).

  • Transformer Encoder: The projected inputs are then passed through a Transformer encoder. This encoder utilizes multi-head self-attention to model complex temporal dependencies within the input sequences [40].

    • Multi-Head Self-Attention: This mechanism allows the model to jointly attend to information from different representation subspaces at different positions. The MultiHead mechanism is defined as:

      H=MultiHead(Q,K,V)=Concat(head1,,headh)WO \mathbf { H } = \mathbf { M } \mathbf { u } \mathbf { l } \mathrm { t i } \mathbf { H } \mathrm { ead } ( \mathbf { Q } , \mathbf { K } , \mathbf { V } ) = \mathbf { C } \mathrm { o n c a t } ( h e a d _ { 1 } , \dots , h e a d _ { h } ) \mathbf { W } ^ { O }

      Where:

      • H\mathbf { H }: The output of the Multi-Head Attention mechanism.

      • Q\mathbf { Q }, K\mathbf { K }, V\mathbf { V }: The query, key, and value matrices, respectively. These are derived from the input sequence through linear transformations.

      • hh: The number of attention heads.

      • head _ { i }: The output of the ii-th attention head.

      • Concat()\mathbf { C } \mathrm { o n c a t } (\cdot): A concatenation operation that joins the outputs of all attention heads along the feature dimension.

      • WO\mathbf { W } ^ { O }: An output projection matrix that linearly transforms the concatenated head outputs back to the desired dimension.

        Each attention head head _ { i } is computed using the Scaled Dot-Product Attention formula:

      headi=Attention(QWiQ,KWiK,VWiV) h e a d _ { i } = \mathrm { A t t e n t i o n } ( \mathbf { Q } \mathbf { W } _ { i } ^ { Q } , \mathbf { K } \mathbf { W } _ { i } ^ { K } , \mathbf { V } \mathbf { W } _ { i } ^ { V } )

      Where:

      • WiQ\mathbf { W } _ { i } ^ { Q }, WiK\mathbf { W } _ { i } ^ { K }, and WiV\mathbf { W } _ { i } ^ { V }: Learnable projection matrices for the ii-th head, which transform the input Q,K,V\mathbf{Q}, \mathbf{K}, \mathbf{V} into different representation subspaces.
      • Attention()\mathrm { A t t e n t i o n } (\cdot): Refers to the Scaled Dot-Product Attention function described in the "Prerequisite Knowledge" section.
  • Dual Output Heads for Probabilistic Predictions: A distinctive feature of KTU is its dual output heads. Instead of predicting a single point estimate, these heads predict both the mean (μ\mu) and variance (σ2\sigma^2) for each target variable. The variance is typically predicted using a Softplus activation function to ensure it remains positive (as variance must be non-negative). This mechanism allows KTU to capture aleatoric uncertainty [22], which is the inherent randomness in the data itself.

    The model outputs probabilistic predictions for both prosumer load (LL) and photovoltaic (PV) generation (PP) as follows:

    p(Lt+kxt)=N(μL(xt),σL2(xt))p(Pt+kxt)=N(μP(xt),σP2(xt)) \begin{array} { r } { p ( L _ { t + k } | \mathbf { x } _ { t } ) = \mathcal { N } ( \mu _ { L } ( \mathbf { x } _ { t } ) , \sigma _ { L } ^ { 2 } ( \mathbf { x } _ { t } ) ) } \\ { p ( P _ { t + k } | \mathbf { x } _ { t } ) = \mathcal { N } ( \mu _ { P } ( \mathbf { x } _ { t } ) , \sigma _ { P } ^ { 2 } ( \mathbf { x } _ { t } ) ) } \end{array}

    Where:

    • p(Lt+kxt)p ( L _ { t + k } | \mathbf { x } _ { t } ): The probability distribution of the load at forecast horizon t+kt+k, given input features xt\mathbf { x } _ { t }.
    • p(Pt+kxt)p ( P _ { t + k } | \mathbf { x } _ { t } ): The probability distribution of the PV generation at forecast horizon t+kt+k, given input features xt\mathbf { x } _ { t }.
    • N(,)\mathcal { N } (\cdot, \cdot): Denotes a Normal (Gaussian) distribution.
    • μL(xt)\mu _ { L } ( \mathbf { x } _ { t } ) and σL2(xt)\sigma _ { L } ^ { 2 } ( \mathbf { x } _ { t } ): The predicted mean and variance for load, respectively, as functions of the input features xt\mathbf { x } _ { t }.
    • μP(xt)\mu _ { P } ( \mathbf { x } _ { t } ) and σP2(xt)\sigma _ { P } ^ { 2 } ( \mathbf { x } _ { t } ): The predicted mean and variance for PV generation, respectively, as functions of the input features xt\mathbf { x } _ { t }.
    • xt\mathbf { x } _ { t }: The input features available at time tt.
    • t+kt+k: The forecast horizon, meaning the prediction is for kk time steps into the future.
  • Physics-Informed Constraints: To ensure physical plausibility, especially for PV generation, the PV mean prediction is modulated by physics-informed constraints based on daylight and seasonality [14]:

    μPfinal(xt)=softplus(μP(xt))xtdaylightxtnorm_daylight \mu _ { P } ^ { \mathrm { f i n a l } } ( \mathbf { x } _ { t } ) = \mathrm { s o f t p l u s } ( \mu _ { P } ( \mathbf { x } _ { t } ) ) \cdot x _ { t } ^ { \mathrm { d a y l i g h t } } \cdot x _ { t } ^ { \mathrm { n o r m \_ d a y l i g h t } }

    Where:

    • μPfinal(xt)\mu _ { P } ^ { \mathrm { f i n a l } } ( \mathbf { x } _ { t } ): The final, physics-informed mean prediction for PV generation.
    • softplus()\mathrm { s o f t p l u s } ( \cdot ): The softplus activation function, defined as softplus(x)=ln(1+ex)\mathrm{softplus}(x) = \ln(1 + e^x), which ensures that the PV generation is non-negative.
    • μP(xt)\mu _ { P } ( \mathbf { x } _ { t } ): The raw mean prediction for PV from the neural network.
    • xtdaylightx _ { t } ^ { \mathrm { d a y l i g h t } }: A binary feature (0 or 1) indicating if it is daylight at time tt. This ensures no PV generation is predicted during nighttime.
    • xtnorm_daylightx _ { t } ^ { \mathrm { n o r m \_ d a y l i g h t } }: A normalized daylight feature (0 to 1) accounting for seasonality (e.g., longer daylight hours in summer, shorter in winter) [14].
  • Hyperparameter Optimization: The model's hyperparameters are optimized using Optuna [1], an automated hyperparameter optimization framework. This targets a three-hour-ahead joint forecast of prosumer load and PV generation. Key hyperparameters tuned include learning rate, batch size, hidden dimensions, number of attention heads (hh), dropout rate, and regularization weights.

  • Architectural Details: The KTU architecture consists of:

    • A two-layer feedforward network projecting inputs to a 128-dimensional space.
    • Layer normalization and ReLU activation [35].
    • Two Transformer encoder layers, each with four attention heads and 512-dimensional feedforward networks with 0.1 dropout.
    • Dual output heads with Softplus activation for variance prediction.
  • Composite Loss Function: The model is optimized using a composite loss function that combines the Gaussian negative log-likelihood (NLL) with domain-specific regularization terms:

    L=12i=1N[log(σi2+ϵ)+(yiμi)2σi2+ϵ] +αt=1T1μt+1μt +βt=1TPt(1xtdaylight) \begin{array} { l } { { \displaystyle { \mathcal { L } } = \frac { 1 } { 2 } \sum _ { i = 1 } ^ { N } \left[ \log ( \sigma _ { i } ^ { 2 } + \epsilon ) + \frac { ( y _ { i } - \mu _ { i } ) ^ { 2 } } { \sigma _ { i } ^ { 2 } + \epsilon } \right] } } \\ { { \displaystyle ~ + \alpha \sum _ { t = 1 } ^ { T - 1 } | \mu _ { t + 1 } - \mu _ { t } | } } \\ { { \displaystyle ~ + \beta \sum _ { t = 1 } ^ { T } P _ { t } ( 1 - x _ { t } ^ { \mathrm { d a y l i g h t } } ) } } \end{array}

    Where:

    • L\mathcal { L }: The total composite loss.
    • \frac { 1 } { 2 } \sum _ { i = 1 } ^ { N } \left[ \log ( \sigma _ { i } ^ { 2 } + \epsilon ) + \frac { ( y _ { i } - \mu _ { i } ) ^ { 2 } } { \sigma _ { i } ^ { 2 } + \epsilon } \right]: This is the Gaussian Negative Log-Likelihood (NLL) loss for heteroscedastic regression. It measures how well the predicted Normal distribution (with mean μi\mu_i and variance σi2\sigma_i^2) matches the true target yiy_i. Lower NLL indicates a better fit and better probabilistic calibration.
      • NN: The number of training samples.
      • y _ { i }: The ground truth target value for sample ii.
      • μi\mu _ { i }: The predicted mean for sample ii.
      • σi2\sigma _ { i } ^ { 2 }: The predicted variance for sample ii.
      • ϵ\epsilon: A small constant (e.g., 10610^{-6}) added for numerical stability, preventing division by zero or logarithm of zero if σi2\sigma_i^2 becomes very small.
    • αt=1T1μt+1μt\alpha \sum _ { t = 1 } ^ { T - 1 } | \mu _ { t + 1 } - \mu _ { t } |: This is a temporal smoothness regularization term. It penalizes large, abrupt changes in consecutive mean predictions, promoting smoother forecasts which are often more realistic in energy systems.
      • α\alpha: A hyperparameter controlling the strength of this regularization.
      • TT: The sequence length of the forecast.
    • βt=1TPt(1xtdaylight)\beta \sum _ { t = 1 } ^ { T } P _ { t } ( 1 - x _ { t } ^ { \mathrm { d a y l i g h t } } ): This is a physics-informed regularization term that penalizes physically impossible nighttime PV generation.
      • β\beta: A hyperparameter controlling the strength of this penalty.
      • P _ { t }: The predicted PV generation at time tt.
      • xtdaylightx _ { t } ^ { \mathrm { d a y l i g h t } }: A binary feature (0 for night, 1 for day). So, 1xtdaylight1 - x _ { t } ^ { \mathrm { d a y l i g h t } } is 1 at night and 0 during the day. This term ensures that if Pt>0P_t > 0 during nighttime (xtdaylight=0x_t^{\mathrm{daylight}}=0), a penalty is incurred.
  • Probabilistic Forecasting Evaluation: For evaluation, samples are drawn from the predicted mean and variance distributions to construct empirical confidence intervals. Metrics like Prediction Interval Coverage Probability (PICP), Mean Prediction Interval Width (MPIW), and Continuous Ranked Probability Score (CRPS) are used to assess the quality and calibration of these probabilistic forecasts.

4.2.3. Deep Q-Networks (DQN)

The DQN framework is used for the Multi-Agent Reinforcement Learning part.

  • Q-function Approximation: Instead of a traditional Q-table, DQN uses a deep neural network (parameterized by θ\theta) to approximate the Q-function, which estimates the maximum expected future reward for taking a particular action in a given state.

  • TD Error Minimization: The neural network's parameters θ\theta are updated via stochastic gradient descent to minimize the temporal-difference (TD) error. The TD error is the difference between the current Q-value estimate and the target Q-value. The update rule for the DQN parameters is:

    θt+1=θt+α[rt+γmaxaq(st+1,a;θTD) \begin{array} { r l } { \theta _ { t + 1 } = \theta _ { t } + \alpha \Big [ r _ { t } + \gamma \operatorname* { m a x } _ { a ^ { \prime } } q \big ( s _ { t + 1 } , a ^ { \prime } ; \theta _ { \mathrm { T D } } \big ) } & { } \\ { - } & { } \end{array}

    This formula, as presented in the paper, appears truncated. A complete DQN update typically minimizes the squared TD error: L = (r_t + \gamma \max_{a'} q(s_{t+1}, a'; \theta_{\text{TD}}) - q(s_t, a_t; \theta))^2. The update equation usually looks like: θt+1=θt+learning_rateθt(rt+γmaxaq(st+1,a;θTD)q(st,at;θ))2 \theta_{t+1} = \theta_t + \text{learning\_rate} \cdot \nabla_{\theta_t} \left( r_t + \gamma \max_{a'} q(s_{t+1}, a'; \theta_{\text{TD}}) - q(s_t, a_t; \theta) \right)^2 However, strictly adhering to the provided text, the formula as written is: θt+1=θt+α[rt+γmaxaq(st+1,a;θTD)q(st,at;θt)] \theta _ { t + 1 } = \theta _ { t } + \alpha \Big [ r _ { t } + \gamma \operatorname* { m a x } _ { a ^ { \prime } } q \big ( s _ { t + 1 } , a ^ { \prime } ; \theta _ { \mathrm { T D } } \big ) - q(s_t, a_t; \theta_t) \Big] Where (based on standard DQN interpretation given the context):

    • θt+1\theta _ { t + 1 }: The updated parameters of the DQN at time t+1t+1.
    • θt\theta _ { t }: The current parameters of the DQN at time tt.
    • α\alpha: The learning rate, controlling the step size of parameter updates.
    • r _ { t }: The immediate reward received at time tt.
    • γ\gamma: The discount factor (between 0 and 1), determining the importance of future rewards.
    • maxaq(st+1,a;θTD)\operatorname* { m a x } _ { a ^ { \prime } } q \big ( s _ { t + 1 } , a ^ { \prime } ; \theta _ { \mathrm { T D } } \big ): The maximum Q-value for the next state st+1s_{t+1} across all possible next actions aa', estimated by the target network with parameters θTD\theta_{\text{TD}}.
    • q(st,at;θt)q(s_t, a_t; \theta_t): The Q-value for the current state sts_t and action ata_t, estimated by the main network with parameters θt\theta_t.
    • θTD\theta _ { \mathrm { T D } }: The parameters of the target network, which are periodically updated to match the main network's parameters. This stabilizes training by providing a more stable target for Q-value updates.

4.2.4. Pricing Model and Double Auction

The market mechanism facilitates P2P energy trading:

  • Distributed P2P Energy Trading Model: The system operates with prosumers (agents) managing their own load, generation, and battery independently. They only share their electricity surplus or deficit with a centralized auctioneer. This preserves privacy as additional sensitive information is withheld.
  • Centralized Auctioneer: The auctioneer acts as both a market clearer and an advisor. It evaluates market conditions and determines the Internal Selling Price (ISP) and Internal Buying Price (IBP).
  • Supply and Demand Ratio (SDR): The ISP and IBP are determined using the Supply and Demand Ratio (SDR) method [16], allowing for real-time price setting based on current system demand and supply.
    • ISP: The price at which prosumers can sell their surplus energy within the community.
    • IBP: The price at which prosumers can purchase energy within the community.
  • Double Auction (DA) Mechanism: The paper employs a double auction mechanism, adapted from Qiu et al. [23].
    • Bids and Offers: Each buyer (β\beta) submits a bid with a price Pβ,bP_{\beta,b} and quantity Qβ,bQ_{\beta,b}. Each seller (σ\sigma) submits an offer with a price Pσ,sP_{\sigma,s} and quantity Qσ,sQ_{\sigma,s}.
    • Order Books: The auctioneer maintains two order books: ObO_b for buy orders and OsO_s for sell orders, both sorted by price (typically buy orders from highest to lowest, sell orders from lowest to highest).
    • Market Clearing: The auctioneer clears the market by matching buyers and sellers based on their bid/offer prices and quantities, using the calculated ISP and IBP. The goal is to maximize local exchange and efficiency before resorting to transactions with the external grid.

4.2.5. Proposed Approach (Integration)

The overall framework combines the KTU probabilistic forecasts with MARL DQN agents within a P2P energy trading simulation. The process flow is shown in Figure 2.

Figure 2. Process flow of Uncertainty-aware Forecasting DQN simulator 该图像是图示,展示了基于不确定性预测的DQN模拟器的流程,包括代理训练、P2P社区的市场拍卖和市场清算等步骤,体现了能源转移和数据传输的过程。图中标识了每个步骤及其相应的关键作用。

Figure 2. Process flow of Uncertainty-aware Forecasting DQN simulator

  • Simulation Environment: The system simulates 10 prosumer agents over 2 million timesteps using the PettingZoo framework (a library for MARL environments).
  • Agent Autonomy: P2P participants are self-interested. The MARL setup trains agents independently to maximize their own utility, without explicit coordination. This allows each agent to adapt to the dynamic actions of others, modeling decentralized, competitive P2P energy trading.
  • State Space: Each agent's state space (si,ts_{i,t}) includes:
    • Current load (Li,tL_{i,t})
    • Current generation (Gi,tG_{i,t})
    • Current battery status (Bi,tB_{i,t})
    • Forecasted load (FLi,tFL_{i,t})
    • Forecasted generation (FGi,tFG_{i,t})
    • Crucially, uncertainty estimates from the probabilistic forecasting model (UL,i,t,UG,i,tU_{L,i,t}, U_{G,i,t}).
  • Action Space: The action space consists of discrete actions representing energy management strategies:
    • Buying energy from the P2P market or grid.
    • Selling energy to the P2P market or grid.
    • Charging the battery.
    • Discharging the battery.
    • Self-consumption (using locally generated energy).
  • Reward Function: The reward function for each action is designed to incorporate forecast uncertainty (via a confidence score αti\alpha_t^i), tariff periods (TgridT_{\mathrm{grid}}), and battery constraints. These are detailed in the "Reward Functions with Uncertainty-aware Forecasting" section below.
  • Market Clearing: At each timestep, agents submit bids and asks based on their needs and internal price signals. The double auction (DA) mechanism clears the market, matching trades to maximize local exchange before relying on external grid transactions.
  • Dynamic Pricing: Dynamic pricing is determined by the Supply and Demand Ratio (SDR) and grid tariffs, ensuring realistic market behavior.
  • Scalability: The framework is designed to be modular and parallelizable, featuring a linear-complexity auction and communication-free agents, allowing for straightforward scaling to larger P2P communities. MARL training can further support larger populations through distributed or federated learning.

Algorithm 1: Uncertainty-Aware MARL for P2P Energy Trading

The step-by-step process is formalized in Algorithm 1:

Algorithm 1 Uncertainty-Aware MARL for P2P Energy Trading

1: Initialize: agent set NN, battery capacities BcapB_{cap}, time t=0t=0 2: while tTt \leq T (simulation period) do 3: if end of day then 4: Reset hour counter and increment day 5: end if 6: for each agent iNi \in N do 7: Observe current (Li,t,Gi,t,Bi,t)(L_{i,t}, G_{i,t}, B_{i,t}) 8: Get forecasts (FLi,t,FGi,t)(FL_{i,t}, FG_{i,t}) and uncertainties (UL,i,t,UG,i,t)(U_{L,i,t}, U_{G,i,t}) 9: Form state vector si,t=[Li,t,Gi,t,Bi,t,FLi,t,FGi,t,UL,i,t,UG,i,t]s_{i,t} = [L_{i,t}, G_{i,t}, B_{i,t}, FL_{i,t}, FG_{i,t}, U_{L,i,t}, U_{G,i,t}] 10: Select action ai,ta_{i,t} using DQN policy π(si,t)\pi(s_{i,t}) 11: Calculate energy balance Ei,t=Gi,tLi,tE_{i,t} = G_{i,t} - L_{i,t} 12: if Ei,t<0E_{i,t} < 0 or ai,ta_{i,t} is buy then 13: Add to BuyerBook(i,Ei,t,pbid)(i, |E_{i,t}|, p_{bid}) 14: else if Ei,t>0E_{i,t} > 0 or ai,ta_{i,t} is sell then 15: Add to SellerBook(i,Ei,t,pask)(i, E_{i,t}, p_{ask}) 16: end if 17: end for 18: Market Clearing: 19: Calculate SDR=\mathrm{SDR} = \sum Supply / \sum Demand 20: if 0SDR10 \leq \mathrm{SDR} \leq 1 then 21: Calculate ISP=λsellλbuy(λbuyλsell)SDR+λsell\overline{\mathrm{ISP}} = \frac{\lambda_{sell} \lambda_{buy}}{(\lambda_{buy} - \lambda_{sell}) \mathrm{SDR} + \lambda_{sell}} 22: Calculate IBP=ISPSDR+λbuy(1SDR)\mathrm{IBP} = \overline{\mathrm{ISP}} \cdot \mathrm{SDR} + \lambda_{buy} \cdot (1 - \mathrm{SDR}) 23: end if 24: Match buyers and sellers by price priority using ISP and IBP 25: Update rewards Ri,tR_{i,t} and advance time tt+1t \gets t + 1 26: end while

Explanation of Algorithm 1:

  1. Initialization: The simulation starts by defining the set of agents (NN), their battery capacities (BcapB_{cap}), and the initial time (t=0t=0).
  2. Simulation Loop: The process continues for a total simulation period TT.
  3. End of Day Check: At the end of each day, a counter is reset, and the day is incremented.
  4. Agent-Specific Actions: For each agent ii in the set NN:
    • Observation: The agent observes its current load (Li,tL_{i,t}), generation (Gi,tG_{i,t}), and battery state of charge (Bi,tB_{i,t}).
    • Forecasting and Uncertainty: The KTU model provides load forecasts (FLi,tFL_{i,t}), generation forecasts (FGi,tFG_{i,t}), and their associated uncertainties (UL,i,t,UG,i,tU_{L,i,t}, U_{G,i,t}).
    • State Vector Formation: These observations, forecasts, and uncertainties are combined to form the state vector si,ts_{i,t}. This is a critical step where uncertainty information is fed into the MARL process.
    • Action Selection: The DQN policy π(si,t)\pi(s_{i,t}) (learned by the DQN for agent ii) selects an action ai,ta_{i,t} based on the current state.
    • Energy Balance Calculation: The agent calculates its energy balance (Ei,t=Gi,tLi,tE_{i,t} = G_{i,t} - L_{i,t}).
    • Order Book Submission:
      • If the agent has an energy deficit (Ei,t<0E_{i,t} < 0) or its chosen action is to buy, it adds an entry to the BuyerBook with its ID, the absolute quantity of energy needed, and its bid price (pbidp_{bid}).
      • If the agent has an energy surplus (Ei,t>0E_{i,t} > 0) or its chosen action is to sell, it adds an entry to the SellerBook with its ID, the quantity of energy available, and its ask price (paskp_{ask}).
  5. Market Clearing: After all agents have submitted their bids and offers:
    • SDR Calculation: The Supply and Demand Ratio (SDR) is calculated as the total supply from SellerBook divided by the total demand from BuyerBook.
    • ISP/IBP Calculation: If SDR is between 0 and 1 (inclusive), the Internal Selling Price (ISP) and Internal Buying Price (IBP) are calculated dynamically using formulas that depend on grid sell price (λsell\lambda_{sell}) and grid buy price (λbuy\lambda_{buy}).
      • The formula for ISP appears to be: ISP=λsellλbuy(λbuyλsell)SDR+λsell\overline{\mathrm{ISP}} = \frac{\lambda_{sell} \lambda_{buy}}{(\lambda_{buy} - \lambda_{sell}) \mathrm{SDR} + \lambda_{sell}}. This represents a dynamic internal price for selling energy within the community.
      • The formula for IBP appears to be: IBP=ISPSDR+λbuy(1SDR)\mathrm{IBP} = \overline{\mathrm{ISP}} \cdot \mathrm{SDR} + \lambda_{buy} \cdot (1 - \mathrm{SDR}). This represents a dynamic internal price for buying energy within the community.
    • Matching: Buyers and sellers are matched by price priority using the determined ISP and IBP to clear the market.
  6. Reward Update and Time Advance: Rewards (Ri,tR_{i,t}) for each agent are updated based on the outcome of their actions and market clearing (using the functions detailed below), and the simulation time advances (tt+1t \gets t + 1).

Reward Functions with Uncertainty-aware Forecasting

The reward functions are designed to guide the DQN agents towards optimal energy management strategies, incorporating forecast uncertainty and grid tariff periods.

Notation:

  • GtiG _ { t } ^ { i }: Generation for agent ii at time tt.

  • LtiL _ { t } ^ { i }: Load for agent ii at time tt.

  • SoCti\mathrm { So } C _ { t } ^ { i }: State of Charge (in %) for agent ii's battery at time tt.

  • TgridT _ { \mathrm { g r i d } }: Grid tariff period, which can be Normal (N), Non-Peak (NP), Peak (P), or Default (D).

  • G~t+ki\tilde { G } _ { t + k } ^ { i }: Future forecasted generation for agent ii at horizon t+kt+k.

  • L^t+ki\hat { L } _ { t + k } ^ { i }: Future forecasted load for agent ii at horizon t+kt+k.

  • Δpeaki\Delta _ { \mathrm { p e a k } } ^ { i }: Peak deficit prediction for agent ii. This indicates if the agent is expected to have an energy deficit during peak hours.

  • αti\alpha _ { t } ^ { i }: Confidence score for agent ii at time tt. This is derived from the uncertainty estimates of the KTU model. A higher confidence score likely means lower predicted uncertainty.

    Here are the specific reward functions for different actions:

1. Charge and Buy This action implies the agent is charging its battery and buying energy from the grid or P2P market to do so.

R={0.5+1.5αti+1.0,if ϕ10.5+αti,if ϕ20.5,if ϕ30,if Tgrid=P R = \left\{ \begin{array} { l l } { 0 . 5 + 1 . 5 \alpha _ { t } ^ { i } + 1 . 0 , } & { \mathrm { i f ~ } \phi _ { 1 } } \\ { 0 . 5 + \alpha _ { t } ^ { i } , } & { \mathrm { i f ~ } \phi _ { 2 } } \\ { 0 . 5 , } & { \mathrm { i f ~ } \phi _ { 3 } } \\ { 0 , } & { \mathrm { i f ~ } T _ { \mathrm { g r i d } } = P } \end{array} \right.

Where:

  • ϕ1:SoCti90%Tgrid=NPΔpeaki>0\phi _ { 1 } : \mathrm { So } C _ { t } ^ { i } \le 9 0 \% \land T _ { \mathrm { g r i d } } = N P \land \Delta _ { \mathrm { p e a k } } ^ { i } > 0: The reward is highest if the battery is not full, it's a non-peak tariff period, and a future peak deficit is predicted. This encourages charging during off-peak times to prepare for peak demand, with extra bonus for higher confidence.
  • ϕ2:SoCti90%Tgrid=N\phi _ { 2 } : \mathrm { So } C _ { t } ^ { i } \le 9 0 \% \land T _ { \mathrm { g r i d } } = N: A moderate reward if the battery is not full and it's a normal tariff period.
  • ϕ3:SoCti90%Gti<Lti\phi _ { 3 } : \mathrm { So } C _ { t } ^ { i } \le 9 0 \% \land G _ { t } ^ { i } < L _ { t } ^ { i }: A baseline reward if the battery is not full and the generation is less than load (i.e., there's a need for energy).
  • Tgrid=PT _ { \mathrm { g r i d } } = P: No reward (0) if the grid tariff is peak, discouraging buying during expensive periods.

2. Buy This action implies the agent is simply buying energy from the grid or P2P market for immediate consumption, without charging the battery.

R={0.25,ifϕ1Tgrid=P0.5,ifϕ10,otherwise R = \left\{ \begin{array} { l l } { 0 . 2 5 , } & { \mathrm { i f } \phi _ { 1 } \wedge T _ { \mathrm { g r i d } } = P } \\ { 0 . 5 , } & { \mathrm { i f } \phi _ { 1 } } \\ { 0 , } & { \mathrm { o t h e r w i s e } } \end{array} \right.

Where:

  • ϕ1:Gti<LtiSoCti<10%\phi _ { 1 } : G _ { t } ^ { i } < L _ { t } ^ { i } \land \mathrm { So } C _ { t } ^ { i } < 1 0 \%: The primary condition is that generation is less than load (agent needs energy) AND the battery SoC is very low (less than 10%).
  • The reward is lower (0.25) if this condition is met during a peak tariff period, and a moderate reward (0.5) otherwise. No reward if the condition ϕ1\phi_1 is not met.

3. Sell This action implies the agent is selling its surplus energy (not from battery) to the grid or P2P market.

R={0.75,if ϕ1Tgrid=P0.5,if ϕ10,otherwise R = { \left\{ \begin{array} { l l } { 0 . 7 5 , } & { { \mathrm { i f ~ } } \phi _ { 1 } \wedge T _ { \mathrm { g r i d } } = P } \\ { 0 . 5 , } & { { \mathrm { i f ~ } } \phi _ { 1 } } \\ { 0 , } & { { \mathrm { o t h e r w i s e } } } \end{array} \right. }

Where:

  • ϕ1:Gti>Lti(SoCti90%)\phi _ { 1 } : G _ { t } ^ { i } > L _ { t } ^ { i } \land ( \mathbf { S } \mathbf { o } C _ { t } ^ { i } \geq 9 0 \% ): The condition is that generation is greater than load (agent has surplus) AND the battery SoC is high (at or above 90%).
  • The reward is higher (0.75) if this condition is met during a peak tariff period (encouraging selling when prices are high), and a moderate reward (0.5) otherwise.

4. Discharge and Sell This action implies the agent is discharging its battery and selling the energy to the grid or P2P market.

R={(0.5+0.5αti)1.5,if ϕ10.5,if ϕ20,otherwise R = \left\{ \begin{array} { l l } { ( 0 . 5 + 0 . 5 \alpha _ { t } ^ { i } ) 1 . 5 , } & { \mathrm { i f ~ } \phi _ { 1 } } \\ { 0 . 5 , } & { \mathrm { i f ~ } \phi _ { 2 } } \\ { 0 , } & { \mathrm { o t h e rw i s e } } \end{array} \right.

Where:

  • ϕ1:Gti>LtiSoCti20%Tgrid=P\phi _ { 1 } : G _ { t } ^ { i } > L _ { t } ^ { i } \land \mathbf { S o } C _ { t } ^ { i } \ge 2 0 \% \land T _ { \mathrm { g r i d } } = P: The reward is highest if generation exceeds load (surplus), battery SoC is at least 20%, AND it's a peak tariff period. This strongly incentivizes selling stored energy during peak times. The confidence score αti\alpha_t^i also boosts the reward.
  • ϕ2:Gti>LtiSoCti90%\phi _ { 2 } : G _ { t } ^ { i } > L _ { t } ^ { i } \land \mathbf { S o } C _ { t } ^ { i } \ge 9 0 \%: A moderate reward if generation exceeds load and battery SoC is very high (at or above 90%), even if not peak time.

5. Discharge and Buy This action implies the agent is discharging its battery, but also buying energy, perhaps to meet immediate load or for other operational reasons. This combination seems unusual without more context but could represent a complex scenario where internal load exceeds generation even after discharging, or for specific grid stability services.

R={(0.5+0.5αti)1.5,if ϕ1Tgrid=P0.5,if ϕ10,otherwise R = \left\{ \begin{array} { l l } { ( 0 . 5 + 0 . 5 \alpha _ { t } ^ { i } ) 1 . 5 , } & { \mathrm { i f ~ } \phi _ { 1 } \wedge T _ { \mathrm { g r i d } } = P } \\ { 0 . 5 , } & { \mathrm { i f ~ } \phi _ { 1 } } \\ { 0 , } & { \mathrm { o t h e rw i s e } } \end{array} \right.

Where:

  • ϕ1:Gti<LtiSoCti10%\phi _ { 1 } : G _ { t } ^ { i } < L _ { t } ^ { i } \land \mathrm { So } C _ { t } ^ { i } \geq 1 0 \%: The condition is that generation is less than load (agent needs energy) AND battery SoC is at least 10%.
  • The reward is boosted by confidence and an additional factor if it's a peak tariff period. A moderate reward (0.5) otherwise. This action likely means meeting high demand from battery during peak periods while potentially still needing to purchase a small amount, or to avoid hitting very low SoC while still optimizing for peak conditions.

6. Self-Consumption This action means the agent is directly consuming its locally generated energy, minimizing reliance on the grid or P2P market.

R={1.2,ifϕ1Tgrid=P1.0,ifϕ10.5,ifϕ20,otherwise R = \left\{ \begin{array} { l l } { 1 . 2 , } & { \mathrm { i f } \phi _ { 1 } \wedge T _ { \mathrm { g r i d } } = P } \\ { 1 . 0 , } & { \mathrm { i f } \phi _ { 1 } } \\ { 0 . 5 , } & { \mathrm { i f } \phi _ { 2 } } \\ { 0 , } & { \mathrm { o t h e rw i s e } } \end{array} \right.

Where:

  • ϕ1:GtiLti0.1\phi _ { 1 } : | G _ { t } ^ { i } - L _ { t } ^ { i } | \leq 0 . 1: The highest reward if generation closely matches load (within 0.1 units), meaning high self-sufficiency. This is further boosted if it occurs during a peak tariff period (avoiding expensive grid purchases).
  • ϕ2:0.1<GtiLti0.2\phi _ { 2 } : 0 . 1 < | G _ { t } ^ { i } - L _ { t } ^ { i } | \leq 0 . 2: A moderate reward if the generation-load mismatch is slightly larger (between 0.1 and 0.2 units). This encourages balancing generation and load locally.

7. Self and Charge This action implies the agent is consuming its local generation, and any surplus is used to charge its battery.

R={0.5+2.0αti+1.0,if ϕ10.5+0.5αti,if ϕ20,if Tgrid=P R = \left\{ \begin{array} { l l } { 0 . 5 + 2 . 0 \alpha _ { t } ^ { i } + 1 . 0 , } & { \mathrm { i f ~ } \phi _ { 1 } } \\ { 0 . 5 + 0 . 5 \alpha _ { t } ^ { i } , } & { \mathrm { i f ~ } \phi _ { 2 } } \\ { 0 , } & { \mathrm { i f ~ } T _ { \mathrm { g r i d } } = P } \end{array} \right.

Where:

  • ϕ1:Gti>LtiSoCti90%Tgrid=NP\phi _ { 1 } : G _ { t } ^ { i } > L _ { t } ^ { i } \land \mathrm { So } C _ { t } ^ { i } \le 9 0 \% \land T _ { \mathrm { g r i d } } = N P: The highest reward if generation exceeds load (surplus), battery is not full, AND it's a non-peak tariff period. This strongly incentivizes using local surplus to charge the battery during off-peak times, benefiting from the confidence score.
  • ϕ2:Gti>LtiSoCti90%\phi _ { 2 } : G _ { t } ^ { i } > L _ { t } ^ { i } \land \mathrm { So } C _ { t } ^ { i } \le 9 0 \%: A moderate reward if generation exceeds load and battery is not full, regardless of tariff period.
  • Tgrid=PT _ { \mathrm { g r i d } } = P: No reward (0) if it's a peak tariff period, discouraging charging when it might be more beneficial to sell or meet peak demand.

8. Self and Discharge This action implies the agent is consuming its local generation and also discharging its battery to meet its load.

R={(0.5+0.5αti)1.5,if ϕ1Tgrid=P0.5,if ϕ10,otherwise R = \left\{ \begin{array} { l l } { ( 0 . 5 + 0 . 5 \alpha _ { t } ^ { i } ) 1 . 5 , } & { \mathrm { i f ~ } \phi _ { 1 } \wedge T _ { \mathrm { g r i d } } = P } \\ { 0 . 5 , } & { \mathrm { i f ~ } \phi _ { 1 } } \\ { 0 , } & { \mathrm { o t h e rw i s e } } \end{array} \right.

Where:

  • ϕ1:Gti<LtiSoCti20%\phi _ { 1 } : G _ { t } ^ { i } < L _ { t } ^ { i } \land \mathrm { So } C _ { t } ^ { i } \geq 2 0 \%: The condition is that generation is less than load (agent needs energy) AND battery SoC is at least 20%.
  • The reward is boosted by confidence and an additional factor if it's a peak tariff period. This strongly incentivizes using stored energy to meet local demand during peak times. A moderate reward (0.5) otherwise.

5. Experimental Setup

5.1. Datasets

The P2P energy trading community is simulated based on real-world characteristics and synthetic data generation:

  • Prosumer Community: Comprises 10 rural Finnish prosumers.
    • Dairy Farms: 4 dairy farms, using data from Uski et al. [31]. This likely includes specific load profiles characteristic of agricultural operations.
    • Households: 6 households, with synthetic loads generated based on Finnish load profiles and seasonal multipliers [10, 29]. This ensures realistic daily and seasonal variations in household energy consumption.
    • Electric Vehicles (EVs): 2 of the 6 households own EVs, introducing dynamic and flexible load components due to charging patterns.
  • PV Generation Data: Photovoltaic (PV) generation is simulated using the System Advisor Model (SAM) [20], a performance and financial model for renewable energy projects developed by the National Renewable Energy Laboratory (NREL). This provides realistic PV output data considering local solar irradiance and weather patterns.
  • Renewable Capacity: Consistent with [30], the renewable capacities (primarily PV) are set to 40% of the annual load for each prosumer, providing a significant but not overwhelming amount of local generation.
  • Domain: The dataset specifically represents a rural Finnish energy community, implying characteristics like high-latitude solar patterns and potentially colder climates.

5.2. Evaluation Metrics

The paper evaluates the performance of the proposed framework using several key performance indicators (KPIs) and specific probabilistic forecasting metrics.

5.2.1. Key Performance Indicators (KPIs) for Trading

The primary KPIs used to assess the economic and operational benefits of the P2P trading system are:

  1. Electricity Purchase Cost:

    • Conceptual Definition: This metric quantifies the total expenditure incurred by all prosumers in the community for purchasing electricity from either the external grid or other P2P participants. Lower costs indicate better economic efficiency and optimization of energy resources.
    • Mathematical Formula: Not explicitly provided in the paper, but conceptually, it would be the sum of (quantity bought from grid * grid purchase price) + (quantity bought from P2P * P2P purchase price) over the simulation period for all agents.
    • Symbol Explanation: Quantity Bought refers to the amount of electricity purchased. Grid Purchase Price is the tariff for buying from the main grid. P2P Purchase Price is the price paid within the P2P market (e.g., IBP).
  2. Electricity Sales Revenue:

    • Conceptual Definition: This metric quantifies the total income generated by all prosumers from selling their surplus electricity to either the external grid or other P2P participants. Higher revenue indicates more effective utilization of local generation and better market participation.
    • Mathematical Formula: Not explicitly provided, but conceptually, it would be the sum of (quantity sold to grid * grid sale price) + (quantity sold to P2P * P2P sale price) over the simulation period for all agents.
    • Symbol Explanation: Quantity Sold refers to the amount of electricity sold. Grid Sale Price is the tariff for selling to the main grid. P2P Sale Price is the price received within the P2P market (e.g., ISP).
  3. Peak Hour Grid Demand Reduction:

    • Conceptual Definition: This metric measures the reduction in the maximum amount of electricity imported from the main grid during peak tariff hours. Reducing peak demand is crucial for grid stability, avoiding costly infrastructure upgrades, and often for reducing reliance on carbon-intensive peaker plants.
    • Mathematical Formula: Not explicitly provided, but conceptually, it would be a comparison between the maximum hourly grid import observed in a baseline scenario versus the proposed model during specified peak hours.
    • Symbol Explanation: Peak Hour Demand refers to the highest aggregate electricity consumption from the grid specifically during designated peak periods.

5.2.2. Probabilistic Forecasting Metrics (for KTU model)

The quality and calibration of the probabilistic forecasts generated by the KTU model are assessed using standard metrics:

  1. Prediction Interval Coverage Probability (PICP):

    • Conceptual Definition: PICP measures the percentage of actual observations that fall within the predicted prediction interval (e.g., a 90% confidence interval). For a well-calibrated probabilistic forecast at a 90% confidence level, the PICP should be close to 90%. If it's too low, the model is underconfident; if too high, it's overconfident or the intervals are too wide.
    • Mathematical Formula: $ \mathrm{PICP} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}(y_i \in [\mathrm{LBI}_i, \mathrm{UBI}_i]) \times 100% $
    • Symbol Explanation:
      • NN: Total number of observations.
      • 1()\mathbf{1}(\cdot): Indicator function, which is 1 if the condition inside the parenthesis is true, and 0 otherwise.
      • yiy_i: The actual observed value for the ii-th instance.
      • [LBIi,UBIi][\mathrm{LBI}_i, \mathrm{UBI}_i]: The Lower Bound and Upper Bound of the prediction interval for the ii-th instance.
  2. Mean Prediction Interval Width (MPIW):

    • Conceptual Definition: MPIW measures the average width of the prediction intervals across all forecasts. For a given PICP, a narrower MPIW is generally preferred, as it indicates more precise forecasts. However, MPIW must be considered in conjunction with PICP to avoid overly narrow intervals that don't cover enough observations.
    • Mathematical Formula: $ \mathrm{MPIW} = \frac{1}{N} \sum_{i=1}^{N} (\mathrm{UBI}_i - \mathrm{LBI}_i) $
    • Symbol Explanation:
      • NN: Total number of observations.
      • UBIi\mathrm{UBI}_i: The Upper Bound of the prediction interval for the ii-th instance.
      • LBIi\mathrm{LBI}_i: The Lower Bound of the prediction interval for the ii-th instance.
  3. Continuous Ranked Probability Score (CRPS):

    • Conceptual Definition: CRPS is a proper scoring rule that evaluates the quality of probabilistic forecasts. It measures the "distance" between the predicted cumulative distribution function (CDF) and the empirical CDF of the observation. Unlike PICP and MPIW, which evaluate the interval, CRPS evaluates the entire probabilistic forecast distribution. A lower CRPS indicates a better, more accurate, and well-calibrated probabilistic forecast.
    • Mathematical Formula: For a continuous probabilistic forecast FF and an observed value yy, the CRPS is defined as: $ \mathrm{CRPS}(F, y) = \int_{-\infty}^{\infty} (F(x) - \mathbf{1}(x \ge y))^2 , dx $ For a Gaussian distribution FN(μ,σ2)F \sim \mathcal{N}(\mu, \sigma^2), the CRPS can be calculated analytically as: $ \mathrm{CRPS}(\mu, \sigma, y) = \sigma \left( \frac{1}{\sqrt{\pi}} - 2 \Phi\left(\frac{y-\mu}{\sigma}\right) \frac{y-\mu}{\sigma} - \left( 2\Phi\left(\frac{y-\mu}{\sigma}\right) - 1 \right)^2 \right) $ or, more commonly, a simplified form for CRPS(F,y)=E[Xy]12E[X1X2]\mathrm{CRPS}(F, y) = E[|X-y|] - \frac{1}{2} E[|X_1-X_2|] where X,X1,X2X, X_1, X_2 are independent draws from FF. For a Gaussian, the closed form is also: $ \mathrm{CRPS}(\mu, \sigma, y) = \sigma \left( \frac{1}{\sqrt{\pi}} \exp\left(-\frac{z^2}{2}\right) + z(2\Phi(z)-1) \right) $ where z=yμσz = \frac{y-\mu}{\sigma}.
    • Symbol Explanation:
      • F(x): The cumulative distribution function (CDF) of the probabilistic forecast.
      • yy: The actual observed value.
      • 1(xy)\mathbf{1}(x \ge y): Indicator function, which is 1 if xyx \ge y, and 0 otherwise. This forms the empirical CDF of the observed value.
      • μ\mu: The mean of the predicted Normal distribution.
      • σ\sigma: The standard deviation of the predicted Normal distribution.
      • Φ(z)\Phi(z): The cumulative distribution function of the standard Normal distribution.
      • zz: The standardized score or z-score for the observation.

5.3. Baselines

The proposed uncertainty-aware DQN model is compared against several baselines to demonstrate its superior performance:

  • Rule Based: A simple rule-based approach, likely employing predefined heuristics for energy management (e.g., charge when PV is high, discharge when load is high). These rules are described in previous work [27, 25].
  • RB+QL (Rule Based + Q-Learning Ensemble): An ensemble approach combining rule-based strategies with Q-learning. This suggests that Q-learning might be used to refine decisions within a rule-based framework or learn when to activate certain rules. These algorithms are described in [27, 25].
  • DQN (Standard Deep Q-Network): A baseline DQN model that does not incorporate uncertainty-aware forecasts. It relies on deterministic (point) forecasts for load and generation. This comparison highlights the specific benefit of integrating uncertainty information.
  • DQN Forecasting (This paper's proposed model): The DQN model that integrates the uncertainty-aware forecasts from the KTU model. The paper refers to this as "DQN Forecasting" to distinguish it from the standard DQN.
  • Other MARL Algorithms: The paper mentions that other MARL algorithms, such as Proximal Policy Optimization (PPO), were also evaluated but DQN "consistently outperformed them." This indicates a comprehensive baseline comparison, even if detailed results for PPO are not presented.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that the integration of uncertainty-aware forecasting into the MARL DQN framework (DQN Forecasting) leads to significant improvements in P2P energy trading compared to traditional approaches.

6.1.1. Reward Convergence

Figure 3 illustrates the episode reward trajectories for the 10 agents over 1.8 million time steps.

Figure 3. Reward Convergence over 2M time steps 该图像是图表,展示了在2百万时间步内多个代理的奖励收敛情况。不同颜色的曲线代表不同代理的累计奖励,显示与时间步的关系,表明随着时间的推移,奖励逐渐趋于稳定,最终达到各自的最大值。

Figure 3. Reward Convergence over 2M time steps

  • Faster Convergence: All agents show a rapid increase in episode reward during early training. A key finding is that the proposed uncertainty-aware forecasting DQN achieves convergence approximately 50% faster than the standard DQN (though the standard DQN trajectory is not shown in this specific figure, it's inferred from the text comparison). It reaches high performance around 600,000 steps, requiring about 25% fewer time steps. This efficiency is attributed to the probabilistic forecasts which allow the model to anticipate future states, thus narrowing the exploration space and guiding agents towards optimal policies more effectively.
  • Stability: The reward curves show stable convergence, indicating that the agents successfully learn effective energy management strategies.
  • Scalability: The modular and parallelizable environment, with a linear-complexity auction and communication-free agents, suggests the approach is scalable for larger, distributed systems, addressing a common concern in MARL applications to real-world microgrids.

6.1.2. Battery Management

Figure 4 presents the average daily battery State of Charge (SoC), load, and generation over a year.

Figure 4. Daily Battery SOC, Load & Generation (Year Avg.) 该图像是一个图表,展示了日平均电池状态(SOC)、负载和发电量随时间变化的趋势。图中蓝线表示平均电池百分比,粉红线表示平均负载,橙线表示平均发电量,各项数据在一天24小时内呈现出明显的波动特征。

Figure 4. Daily Battery SOC, Load & Generation (Year Avg.)

  • Anticipatory Behavior: The figure shows that battery percentage (blue line) steadily rises from early morning, peaking in the late afternoon. This correlates with the average generation (orange line) pattern, indicating that agents are coordinating charging during periods of high renewable generation (e.g., PV output during sunny hours).
  • Strategic Discharge: The battery SoC then declines as stored energy is dispatched to meet evening loads (pink line), particularly before and during peak hours. This pre-charging and strategic discharging behavior reduces reliance on the utility grid during peak tariff hours, contributing to cost-effectiveness and lower carbon emissions associated with peak hour generation.
  • Improved Utilization: This anticipatory behavior, informed by uncertainty-aware forecasts, contrasts with potentially more reactive strategies of standard DQN or rule-based methods. It leads to more effective and community-beneficial storage utilization.

6.1.3. Key Performance Indicators (KPIs)

The following are the results from Table 1 of the original paper:

Metric Scenario Rule Based RB+QL DQN DQN Forecasting % Diff (DQN vs DQN Forecasting)
Electricity Cost (Bought) (€) w/o P2P 125400 121300 105000 99100 -5.7%
with P2P 119500 116800 102100 96800 -3.2%
P2P vs w/o P2P (%) -4.7% -3.7% -2.8% -2.9%
Electricity Revenue (Sold) (€) w/o P2P 3600 3800 7850 8350 +6.4%
with P2P 4400 4650 14450 20900 +44.7 %
P2P vs w/o P2P (%) +22.2% +22.4% +84.1% +150.1%
Peak Hour Demand (kW) w/0 P2P 36000 34500 23200 14200 -38.8%
with P2P 28500 26600 21850 11900 -45.6%
P2P vs w/o P2P (%) -20.8% -22.9% -5.8% -16.2%

Analysis of KPIs:

  1. Electricity Cost (Bought):

    • Without P2P: DQN Forecasting reduces costs by 5.7% (€105,000 to €99,100) compared to standard DQN.
    • With P2P: DQN Forecasting reduces costs by 3.2% (€102,100 to €96,800) compared to standard DQN.
    • Impact of P2P: P2P trading itself significantly reduces costs across all models (e.g., standard DQN from €105,000 to €102,100). The combination of DQN Forecasting with P2P yields the lowest absolute cost (€96,800).
    • Overall: The uncertainty-aware approach consistently lowers energy purchase costs, showing the economic benefit of risk-informed decisions.
  2. Electricity Revenue (Sold):

    • Without P2P: DQN Forecasting increases revenue by 6.4% (€7,850 to €8,350) compared to standard DQN.
    • With P2P: DQN Forecasting achieves a remarkable 44.7% increase (€14,450 to €20,900) compared to standard DQN.
    • Impact of P2P: P2P trading has a massive positive impact on revenue (e.g., standard DQN revenue jumps by 84.1% from €7,850 to €14,450). When P2P is enabled, the uncertainty-aware model allows agents to exploit selling opportunities far more effectively, resulting in the highest revenue (€20,900).
    • Overall: The most significant gain is in sales revenue, especially when P2P trading is active, indicating that uncertainty-aware agents are better at identifying and leveraging surplus energy for profit within the community.
  3. Peak Hour Demand (kW):

    • Without P2P: DQN Forecasting reduces peak hour grid demand by 38.8% (from 23,200 kW to 14,200 kW) compared to standard DQN.
    • With P2P: DQN Forecasting achieves an even greater reduction of 45.6% (from 21,850 kW to 11,900 kW) compared to standard DQN.
    • Impact of P2P: P2P trading also contributes to peak demand reduction (e.g., standard DQN reduces it by 5.8% from 23,200 kW to 21,850 kW). The combination with uncertainty-aware forecasting results in the lowest peak demand (11,900 kW), demonstrating strong demand-side management and grid-stress mitigation.
    • Overall: This metric highlights the operational benefits for the grid and community, showing that uncertainty-aware models, especially with P2P, can significantly flatten the peak load profile.

6.1.4. Comparison with Baselines

  • Rule-based and RB+QLRB+QL methods show incremental improvements but lack the adaptability and predictive power of DQN-based approaches.
  • Standard DQN provides a substantial improvement over rule-based methods, demonstrating the power of RL in this domain.
  • However, DQN Forecasting (the proposed model) consistently outperforms standard DQN across all KPIs, both with and without P2P trading. This validates the core hypothesis that uncertainty-aware forecasting enhances RL performance.
  • The improvements are "even more pronounced when P2P trading is enabled," underscoring the synergy between advanced forecasting and market mechanisms.

6.2. Ablation Studies / Parameter Analysis

While the paper doesn't present formal ablation studies in a dedicated section, the comparison between "DQN" and "DQN Forecasting" effectively serves as an ablation on the uncertainty-aware forecasting component. The DQN baseline represents a DQN model that likely uses deterministic forecasts (or simpler forecasting methods), while DQN Forecasting integrates the KTU model's probabilistic forecasts. The consistent and significant performance gains of DQN Forecasting over DQN across all metrics (up to 44.7% revenue increase, 45.6% peak demand reduction) strongly validate the effectiveness of incorporating uncertainty-aware forecasting into the MARL framework.

The automated hyperparameter optimization via Optuna also indicates a systematic approach to finding optimal parameters for the KTU model, which contributes to its performance.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully introduces a novel framework for P2P energy trading that synergistically combines uncertainty-aware knowledge transformer forecasting (KTU) with multi-agent deep reinforcement learning (MARL DQN). By explicitly quantifying and integrating prediction uncertainty into the MARL state space and reward functions, the proposed approach enables agents to make risk-sensitive and anticipatory decisions in stochastic energy environments.

The key achievements include:

  • Faster Convergence: Approximately 50% faster MARL training convergence.

  • Efficient Battery Management: Smarter charging and discharging strategies based on forecasted generation and load and their uncertainty.

  • Significant Economic Benefits: Energy purchase costs reduced by up to 5.7% (without P2P) and 3.2% (with P2P). Electricity sales revenue increased by 44.7% (with P2P).

  • Operational Resilience: Peak hour grid demand reduced by up to 45.6% (with P2P).

    These results establish a new benchmark for resilient, efficient, and sustainable P2P energy trading systems, underscoring that uncertainty-aware learning is crucial for maximizing both economic and operational gains in decentralized energy communities.

7.2. Limitations & Future Work

The authors acknowledge several areas for future research:

  • Integration of Additional Market Mechanisms: Exploring more complex or novel market designs within the P2P framework.
  • Real-World Pilot Deployments: Moving beyond simulation to test the framework in actual microgrids or P2P communities to validate its performance and address practical challenges.
  • Optimization of the Forecasting Horizon: Investigating how different forecasting horizons (e.g., short-term vs. long-term) impact MARL decisions and overall system performance, and optimizing this parameter.
  • Theoretical Analysis or Guarantees: Conducting more rigorous theoretical analysis regarding the convergence properties of the uncertainty-aware MARL system, and potentially providing performance guarantees.

7.3. Personal Insights & Critique

This paper presents a highly relevant and innovative contribution to the field of P2P energy trading. The explicit focus on uncertainty quantification and its direct integration into MARL decision-making is a significant step forward, addressing a critical weakness in much of the existing literature. The combination of Transformer architectures for forecasting with DQN for control, coupled with domain-specific regularization and reward functions, demonstrates a well-engineered and practical solution.

Inspirations and Applications:

  • Robustness in Stochastic Systems: The methodology for incorporating uncertainty estimates into the state space and reward function could be applied to other domains characterized by high stochasticity and sequential decision-making, such as supply chain management, autonomous driving (e.g., predicting other agents' uncertain movements), or financial trading.
  • Proactive Resource Management: The anticipatory battery management (pre-charging for peak hours) learned by agents highlights the power of probabilistic forecasts to enable proactive rather than reactive resource optimization. This could be beneficial for any resource management problem where future supply/demand is uncertain but critical.
  • Synergy of AI Components: The paper effectively demonstrates the synergy between advanced deep learning models (for forecasting) and reinforcement learning (for control), showing how combining specialized AI components can lead to superior overall system performance.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

  • Complexity of Reward Function: While detailed, the reward functions are quite complex and hand-engineered. Their optimality is assumed based on improved KPIs. In real-world scenarios, designing such intricate reward functions can be challenging and might require extensive domain expertise or iterative tuning. Inverse Reinforcement Learning could be an alternative to infer rewards from desired expert behavior.

  • Scalability of Double Auction: The paper mentions that the double auction has linear complexity and is scalable. However, for very large numbers of prosumers (e.g., thousands), a centralized auctioneer might still face computational or communication bottlenecks. Distributed P2P market clearing mechanisms or blockchain-based solutions might need to be explored further to maintain true decentralization and scalability in future work.

  • Generalizability of Uncertainty Model: The KTU model's effectiveness relies on the quality of its uncertainty quantification. While PICP, MPIW, and CRPS are good metrics, their performance for very rare or extreme events (e.g., sudden equipment failures, extreme weather events) might need further investigation. The Gaussian assumption for load and PV distribution might not always hold perfectly for all energy phenomena.

  • Real-world Implementation Challenges: Moving from simulation to real-world deployment (as suggested for future work) will introduce practical challenges such as data privacy concerns, communication latency, hardware limitations for DER control, and regulatory frameworks. The current model relies on a centralized auctioneer, which may raise privacy concerns despite sharing only aggregated data. Fully decentralized market mechanisms could be a more robust approach in practice.

  • Interpretability of DQN Actions: While the results show improved performance, the exact policy learned by the DQN agents (e.g., why a specific buy or sell action was chosen at a precise moment given the uncertainty) might still lack full interpretability. For critical infrastructure like energy grids, explainable AI is gaining importance.

  • Computational Cost: Training Transformer-based models and MARL DQN systems can be computationally intensive. The paper mentions faster convergence, but the absolute training time and resource requirements could still be substantial, especially for scaling to larger communities or more complex environments.

    Overall, this paper makes a compelling case for the necessity and benefits of uncertainty-aware MARL in P2P energy trading, offering a robust framework that pushes the boundaries of current research.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.