AiPaper
Paper status: completed

A survey on physics informed reinforcement learning: Review and open problems

Published:01/01/2023
Original Link
Price: 0.10
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper reviews the emerging field of Physics-Informed Reinforcement Learning (PIRL), introducing a novel taxonomy based on the RL pipeline to better understand current methods and identify key challenges, highlighting the potential of enhancing RL algorithms' applicability an

Abstract

The fusion of physical information in machine learning frameworks has revolutionized many application areas. This work explores their utility for reinforcement learning applications. A thorough review of the literature on the fusion of physics information in reinforcement learning approaches, commonly referred to as physics-informed reinforcement learning (PIRL), is presented. A novel taxonomy is introduced with the reinforcement learning pipeline as the backbone to classify existing works, compare and contrast them, and derive crucial insights. Existing works are analyzed with regard to the representation/form of the governing physics modeled for integration, their specific contribution to the typical reinforcement learning architecture, and their connection to the underlying reinforcement learning pipeline stages. Core learning architectures and physics incorporation biases of existing PIRL approaches are identified and used to further categorize the works for better understanding and adaptation. By providing a comprehensive perspective on the implementation of the physics-informed capability, the taxonomy presents a cohesive approach to PIRL. It identifies the areas where this approach has been applied, as well as the gaps and opportunities that exist. Additionally, the review highlights unresolved issues and challenges, while also incorporating potential and emerging solutions to guide future research. This nascent field holds great potential for enhancing reinforcement learning algorithms by increasing their physical plausibility, precision, data efficiency, and applicability in real-world scenarios.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

A survey on physics informed reinforcement learning: Review and open problems

1.2. Authors

  • Chayan Banerjee (Queensland University of Technology, Brisbane, Australia)
  • Kien Nguyen (Queensland University of Technology, Brisbane, Australia)
  • Clinton Fookes (Queensland University of Technology, Brisbane, Australia)
  • Maziar Raissi (University of Colorado, Boulder, USA)

1.3. Journal/Conference

This paper is a review article. The provided metadata indicates it was published on 2023-01-01T00:00:00.000Z. Given its scope as a comprehensive survey, it would typically be published in a reputable journal focusing on machine learning, artificial intelligence, or control systems. While the specific journal name is not provided, the affiliations of the authors and the depth of the review suggest a high-impact venue. Maziar Raissi is notably associated with Physics-Informed Neural Networks (PINNs), a foundational concept in the field, lending significant credibility to the work.

1.4. Publication Year

2023

1.5. Abstract

This work surveys the emerging field of Physics-Informed Reinforcement Learning (PIRL), which integrates physical information into Reinforcement Learning (RL) frameworks. The paper introduces a novel taxonomy, structured around the RL pipeline, to categorize existing PIRL approaches based on the form of physics modeled, its contribution to RL architectures, and its connection to RL pipeline stages. It identifies core learning architectures and physics incorporation biases, providing a cohesive understanding of PIRL implementations. The review highlights current applications, research gaps, challenges, and potential solutions, emphasizing PIRL's potential to enhance RL algorithms in terms of physical plausibility, precision, data efficiency, and real-world applicability.

/files/papers/6919fecd110b75dcc59ae34d/paper.pdf (Published status: Likely an officially published paper, given the formal structure and affiliations, though the direct journal link is not provided in the prompt.)

2. Executive Summary

2.1. Background & Motivation

The core problem addressed by this paper is the inherent limitations of conventional, purely data-driven Reinforcement Learning (RL) algorithms when applied to complex real-world systems. While RL has achieved impressive results in simulations, it often struggles with real-world data due to several challenges:

  • Sample Efficiency: RL typically requires a vast amount of interaction data, which is often expensive or unsafe to collect in real-world environments.

  • High-Dimensionality: Real-world state and action spaces are frequently continuous and high-dimensional, making learning difficult.

  • Safe Exploration: Unconstrained trial-and-error exploration can lead to dangerous or physically implausible states in safety-critical applications.

  • Disconnection between Simulation and Reality (Sim-to-Real gap): Models trained in simulation often perform poorly when deployed in the physical world due to discrepancies.

  • Under-defined Reward Functions: Designing effective reward functions that accurately guide RL agents in complex tasks is challenging.

    The importance of this problem stems from the increasing demand for RL solutions in domains like autonomous driving, robotics, and energy systems, where physical plausibility, safety, and efficiency are paramount.

The paper's entry point is the recognition that incorporating mathematical physics into machine learning models, a paradigm known as Physics-Informed Machine Learning (PIML), has revolutionized many application areas. PIML allows neural networks to learn more efficiently from incomplete data by embedding physical laws, leading to faster training, better generalization, and physically sound solutions. Given that most RL-based solutions deal with real-world problems with inherent physical structures, RL is a natural candidate for PIML. The innovative idea is to comprehensively review and systematize the fusion of physical information into RL, termed Physics-Informed Reinforcement Learning (PIRL).

2.2. Main Contributions / Findings

The paper makes several primary contributions to the field of PIRL:

  1. Unified Taxonomy: It proposes a novel and comprehensive taxonomy for PIRL, structured around the Reinforcement Learning (RL) pipeline. This taxonomy classifies works based on:

    • The type of physics information modeled (e.g., differential equations, barrier certificates, physics parameters, simulators).
    • The PIRL methods used for augmentation (e.g., state design, action regulation, reward design, augmenting policy/value networks, augmenting simulators).
    • The stages of the RL pipeline where physics is integrated (e.g., problem representation, learning strategy, network design, training, deployment).
  2. Additional Categorizations: It introduces two additional categories, Learning Architecture and Bias (e.g., observational, learning, inductive), to provide a more precise understanding of how physics is implemented in PIRL approaches.

  3. Algorithmic Review: It provides a state-of-the-art review of PIRL methods, using unified notations and functional diagrams to explain the latest literature across diverse domains.

  4. Training and Evaluation Benchmark Review: It analyzes the evaluation benchmarks used in PIRL literature, identifying popular platforms and trends for easy reference.

  5. In-depth Analysis: It delves into how physics information is integrated into specific RL approaches, the types of physical processes modeled, and the network architectures utilized.

  6. Identification of Open Problems: It summarizes challenges, unresolved issues, and future research directions (e.g., high-dimensional spaces, safety in complex environments, choice of physics prior, benchmarking, scalability).

    The key conclusions and findings are that PIRL holds immense potential to address major RL challenges by:

  • Increasing Physical Plausibility: Ensuring learned policies adhere to fundamental physical laws.

  • Improving Precision and Generalization: Leading to more accurate and robust solutions that transfer better to unseen scenarios.

  • Enhancing Data Efficiency: Reducing the reliance on vast amounts of real-world data by leveraging physics knowledge.

  • Expanding Real-World Applicability: Making RL more viable for safety-critical and complex dynamic systems.

    These findings collectively provide a structured framework for understanding, developing, and evaluating PIRL approaches, solving the problem of a fragmented and unorganized research landscape in this nascent field.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a foundational understanding of Reinforcement Learning (RL) and Physics-Informed Machine Learning (PIML) is essential.

3.1.1. Reinforcement Learning (RL)

Reinforcement Learning (RL) is a paradigm of machine learning where an agent learns to make decisions by interacting with an environment to achieve a goal. The learning process is driven by rewards received from the environment.

  • Agent: The learner or decision-maker.

  • Environment: Everything outside the agent, with which it interacts.

  • State (ss): A representation of the current situation of the environment observed by the agent.

  • Action (aa): A decision made by the agent that affects the environment.

  • Reward (rr): A scalar feedback signal from the environment, indicating the desirability of the agent's action in a given state. The agent's goal is to maximize the cumulative reward over time.

  • Policy (π\pi): A strategy that maps observed states to actions. An optimal policy π\pi^* maximizes the expected cumulative reward.

  • Value Function: A prediction of the future reward an agent can expect from a given state (or state-action pair).

    The interaction typically follows a Markov Decision Process (MDP) framework.

3.1.2. Markov Decision Process (MDP)

A Markov Decision Process (MDP) is a mathematical framework for modeling sequential decision-making problems, which forms the backbone of most RL problems. It is formally defined by a tuple (S,A,R,P,γ)(S, \mathcal{A}, R, P, \gamma):

  • SS: A set of possible states of the environment.

  • A\mathcal{A}: A set of possible actions the agent can take.

  • R(st+1,at)R(s_{t+1}, a_t): The reward function, which defines the scalar reward received after taking action ata_t from state sts_t and transitioning to state st+1s_{t+1}.

  • P(st+1st,at)P(s_{t+1}|s_t, a_t): The transition probability function (or environment model), which describes the probability of transitioning to state st+1s_{t+1} from state sts_t after taking action ata_t. The "Markov" property means that the future state depends only on the current state and action, not on the entire history of states and actions.

  • γ[0,1]\gamma \in [0, 1]: The discount factor, which determines the present value of future rewards. A higher γ\gamma values future rewards more heavily.

    The agent's objective in an MDP is to find an optimal policy π\pi^* that maximizes the expected cumulative discounted reward over a trajectory τ=(s1,a1,s2,a2,,sT,aT,sT+1)\tau = (s_1, a_1, s_2, a_2, \dots, s_T, a_T, s_{T+1}). The probability of a trajectory τ\tau under a policy πϕ\pi_{\phi} is given by: $ p _ { \phi } ( \tau ) = p ( s _ { 1 } ) \displaystyle \prod _ { t = 1 } ^ { T } \pi _ { \phi } ( a _ { t } | s _ { t } ) p ( s _ { t + 1 } | s _ { t } , a _ { t } ) $ where ϕ\phi represents the policy parameters.

The objective function to maximize is: $ \phi ^ { * } = \arg \operatorname* { m a x } _ { \phi } \underbrace { { \mathbf E } _ { \tau \sim p _ { \phi } ( \tau ) } \Big [ \sum _ { t = 1 } ^ { T } \gamma ^ { t } R ( \boldsymbol { a } _ { t } , \boldsymbol { s } _ { t + 1 } ) \Big ] } _ { \mathcal { I } ( \phi ) } $ where I(ϕ)\mathcal{I}(\phi) is the expected return, and γ\gamma is the discount factor.

3.1.3. Model-Free Reinforcement Learning (MFRL)

In Model-Free Reinforcement Learning (MFRL), the agent does not explicitly learn or have access to the environment model P(st+1st,at)P(s_{t+1}|s_t, a_t). Instead, it learns directly from interactions (experiences) with the environment. Examples include Q-learning, SARSA, Policy Gradient methods (e.g., REINFORCE, Actor-Critic), Proximal Policy Optimization (PPO), and Soft Actor-Critic (SAC). These methods are typically simpler to implement but often require a large number of samples for effective learning.

3.1.4. Model-Based Reinforcement Learning (MBRL)

In Model-Based Reinforcement Learning (MBRL), the agent attempts to learn or is provided with an explicit representation of the environment model P(st+1st,at)P(s_{t+1}|s_t, a_t). This model can then be used for planning, simulating future outcomes, and generating synthetic data, which can significantly improve sample efficiency. However, learning an accurate model can be challenging, and any inaccuracies in the learned model can lead to suboptimal policies or poor performance in the real environment. Examples include Dyna architectures and Model Predictive Control (MPC) with learned models.

3.1.5. Physics-Informed Machine Learning (PIML)

Physics-Informed Machine Learning (PIML) refers to a class of machine learning methods, often involving neural networks, that integrate scientific knowledge (e.g., physical laws, differential equations, conservation principles) directly into the learning process. This integration can happen through various mechanisms, such as modifying the loss function with physics-based penalties, designing specialized network architectures that inherently respect physical laws, or using physics models to generate or augment training data. The benefits of PIML include:

  • Physical Consistency: Solutions adhere to fundamental physical laws.
  • Data Efficiency: Less data is needed for training as physics provides a strong prior.
  • Improved Generalization: Models generalize better to unseen scenarios or regions with sparse data.
  • Enhanced Interpretability: The physical basis can make models more transparent.
  • Faster Training: Physics constraints can guide the optimization process more effectively. A prominent example is Physics-Informed Neural Networks (PINNs), which embed Partial Differential Equations (PDEs) into the neural network's loss function.

3.2. Previous Works

The paper differentiates itself from other surveys by offering a focused and comprehensive review specifically on RL approaches that utilize the structure, properties, or constraints unique to the underlying physics of a process/system.

  • Karniadakis et al. (2021): Provided a comprehensive review on Machine Learning (ML) in the context of physics-informed (PI) methods but did not specifically discuss RL domain approaches. This paper fills that gap.
  • Hao et al. (2022): Offered an overview of physics-informed machine learning, briefly touching upon PIRL. This current survey provides a much deeper dive into the RL aspect.
  • EEBerer et al. (2022): Showcased the use of prior knowledge to guide RL algorithms, specifically for robotic applications. They categorized knowledge into expert, world, and scientific. In contrast, the current paper's scope of application domains is broader, extending beyond robotics to motion control, molecular structure optimization, safe exploration, and robot manipulation, and introduces a novel taxonomy specifically for PIRL.

3.3. Technological Evolution

The evolution of Physics-Informed Reinforcement Learning (PIRL) can be traced through the broader trends in AI, ML, and control theory:

  1. Early RL (Pre-2010s): Focused on tabular methods or linear function approximation, often in discrete, low-dimensional environments. The integration of domain knowledge was typically manual and heuristic-based.
  2. Rise of Deep RL (2012 onwards): With the advent of Deep Learning (DL) (e.g., Deep Q-Networks (DQN), AlphaGo), RL gained the ability to handle high-dimensional sensory inputs (like images) and complex control tasks. However, these methods were often purely data-driven, leading to issues with sample efficiency, safety, and physical plausibility in real-world scenarios.
  3. Emergence of PIML (Mid-2010s onwards): Researchers began explicitly embedding physical laws into ML models (e.g., PINNs by Raissi et al., 2019). This addressed limitations of purely data-driven models, particularly in scientific computing and engineering, by ensuring physical consistency and improving generalization with limited data.
  4. PIRL as a Converging Field (Late 2010s-Present): The natural synergy between PIML and RL led to the birth of PIRL. The realization that RL agents operating in physical environments could greatly benefit from respecting underlying physics spurred research into integrating physics into various components of the RL pipeline:
    • Safe RL: Incorporating control theory concepts like Control Barrier Functions (CBFs) to guarantee safety.

    • Model Learning: Using physics to build more accurate and generalizable world models in MBRL.

    • Reward Shaping: Designing rewards that naturally encourage physically sound behavior.

    • State Representation: Creating physically informed state features to reduce dimensionality and improve interpretability.

    • Network Architectures: Developing neural network designs that inherently encode physical symmetries or dynamics.

      This paper's work fits into this timeline by consolidating the diverse approaches developed within PIRL, providing a structured overview, and identifying future directions for this rapidly growing field.

3.4. Differentiation Analysis

Compared to related surveys, this paper's core differences and innovations lie in its specific focus and the depth of its analytical framework for PIRL:

  1. Exclusive Focus on PIRL: Unlike broader PIML surveys (e.g., Karniadakis et al., 2021; Hao et al., 2022) that cover various ML paradigms (supervised, unsupervised, RL), this paper exclusively focuses on Reinforcement Learning. This allows for a much more granular and tailored analysis of how physics interacts with RL-specific challenges and components.

  2. Novel Taxonomy based on RL Pipeline: The most significant innovation is the introduction of a new, unified taxonomy. This framework dissects PIRL methods not just by what physics is used, but how it's integrated (the specific PIRL method) and where in the RL pipeline this integration occurs (problem representation, learning strategy, network design, training, deployment). This provides a holistic, structured view that helps researchers understand the complete integration landscape.

  3. Comprehensive Categorization of Physics Information: The paper meticulously categorizes the types of physics information (e.g., DAE, BPC, PPV, ODR, PS, PPR). This detailed breakdown is crucial for understanding the diverse ways physical knowledge can be leveraged.

  4. Detailed PIRL Methodologies and Learning Architectures: It provides an in-depth analysis of specific PIRL methods (e.g., state design, action regulation, reward design) and introduces novel Learning Architecture categories (e.g., safety filter, physics embedded network, differentiable simulator). This architectural perspective offers practical insights into implementation choices.

  5. Application Domain Breadth: While some prior works might focus on specific domains like robotics (e.g., EEBerer et al., 2022), this paper's review spans a much wider range of PIRL applications, from motion control and molecular optimization to safe exploration and energy management, demonstrating the general applicability of PIRL principles.

  6. Identification of Physics Incorporation Biases: By mapping PIRL approaches to observational, learning, and inductive biases from the broader PIML literature, the paper establishes a clear connection to established PIML paradigms, enriching the theoretical understanding of PIRL.

    In essence, this paper provides a much-needed organizational framework and a granular, multi-faceted analysis of PIRL, making it a definitive guide for both newcomers and experienced researchers in this interdisciplinary field.

4. Methodology

This section describes the methodology employed by the survey paper itself to review and classify the Physics-Informed Reinforcement Learning (PIRL) literature. The core methodology revolves around introducing a novel, unified taxonomy and two additional categorizations to analyze existing works comprehensively.

4.1. Principles

The core idea behind the paper's methodology is to systematically organize the burgeoning field of PIRL by introducing a principled framework for understanding, comparing, and contrasting diverse approaches. The theoretical basis is that by deconstructing PIRL methods according to the nature of the physics knowledge, the specific RL components affected, and their placement within the RL pipeline, a clearer picture of the field's landscape, innovations, and gaps can emerge. The intuition is that PIRL is a multidisciplinary field, and a structured categorization can help in identifying common patterns, successful integration strategies, and areas for future research.

4.2. Core Methodology In-depth (Layer by Layer)

The paper's methodology for analyzing PIRL literature is built upon a novel taxonomy and two additional categorization schemes. The overall flow involves identifying relevant literature, extracting details about physics incorporation, and then classifying these details using the proposed framework.

4.2.1. Literature Search and Identification

The survey draws on high-quality sources, including Semantic Scholar, Google Scholar, IEEE Xplore, and SpringerLink. Keywords such as physics-informed, physics-aided, physics informed reinforcement learning, and physics priors were used to search for relevant peer-reviewed journals, conference papers, and technical reports. The volume of research identified over the years is illustrated in Figure 1 of the original paper, showing an exponential growth trend.

The following figure (Figure 1 from the original paper) illustrates the volume of research works identified in this process.

Fig. 1. PIRL papers published over the years. This graph illustrates the exponential growth of PIRL papers over the last seven years. The \(2 0 2 4 ^ { * }\) data point indicates the trend, with only k… 该图像是图表,展示了PIRL论文在过去七年中的发表数量增长趋势。通过条形图清晰显示,从2018年到2023年,这些论文的数量显著上升,2024年的数据点用星号标注,指示了未来的增长趋势。

4.2.2. Reinforcement Learning (RL) Fundamentals

The paper first establishes the fundamental RL framework, which serves as the backbone for its taxonomy. It defines RL using the Agent-Environment paradigm and the Markov Decision Process (MDP).

The agent-environment RL framework is presented as follows (Figure 2 from the original paper):

该图像是示意图,展示了物理信息驱动的强化学习(PIRL)各个阶段的结构与流程,包括 RL 代理的安全过滤器、PI 奖励、系统动态嵌入和其他相关组件。在这些阶段中,强化学习模型与物理信息之间的交互得以清晰呈现。 该图像是示意图,展示了物理信息驱动的强化学习(PIRL)各个阶段的结构与流程,包括 RL 代理的安全过滤器、PI 奖励、系统动态嵌入和其他相关组件。在这些阶段中,强化学习模型与物理信息之间的交互得以清晰呈现。

The MDP is formally represented by the tuple (S,A,R,P,γ)(S, \mathcal{A}, R, P, \gamma), where:

  • SS: Represents the set of states of the environment.

  • A\mathcal{A}: Represents the set of actions that the RL agent can take.

  • R(st+1,at)R(s_{t+1}, a_t): The reward function, which typically generates a reward due to action-induced state transition from sts_t to st+1s_{t+1}.

  • P(st+1st,at)P(s_{t+1}|s_t, a_t): The environment model that returns the probability of transitioning to state st+1s_{t+1} from sts_t.

  • γ[0,1]\gamma \in [0, 1]: The discount factor, which determines the emphasis given to immediate rewards relative to future rewards.

    The agent-environment interaction is recorded as a trajectory τ\tau, and the closed-loop trajectory distribution for an episode t=1,,Tt = 1, \dots, T is given by: $ \begin{array} { r } { p _ { \phi } ( \tau ) = p _ { \phi } ( s _ { 1 } , a _ { 1 } , s _ { 2 } , a _ { 2 } , \cdots , s _ { T } , a _ { T } , s _ { T + 1 } ) } \ { = p ( s _ { 1 } ) \displaystyle \prod _ { t = 1 } ^ { T } \pi _ { \phi } ( a _ { t } | s _ { t } ) p ( s _ { t + 1 } | s _ { t } , a _ { t } ) , } \end{array} $ where:

  • τ\tau: Represents the sequence of states and control actions (s1,a1,s2,a2,,sT,aT,sT+1)( s _ { 1 } , a _ { 1 } , s _ { 2 } , a _ { 2 } , \cdots , s _ { T } , a _ { T } , s _ { T + 1 } ).

  • p(s1)p(s_1): The initial state distribution.

  • πϕ(atst)\pi_{\phi}(a_t|s_t): The agent's policy, which is a probability distribution over actions given a state, parameterized by ϕ\phi.

  • p(st+1st,at)p(s_{t+1}|s_t, a_t): The environment's dynamics, the probability of the next state given the current state and action.

    The objective is to find an optimal policy parameter ϕ\phi^* that maximizes the expected cumulative discounted reward: $ \phi ^ { * } = \arg \operatorname* { m a x } _ { \phi } \underbrace { { \mathbf E } _ { \tau \sim p _ { \phi } ( \tau ) } \Big [ \sum _ { t = 1 } ^ { T } \gamma ^ { t } R ( \boldsymbol { a } _ { t } , \boldsymbol { s } _ { t + 1 } ) \Big ] } _ { \mathcal { I } ( \phi ) } , $ where:

  • I(ϕ)\mathcal{I}(\phi): The objective function representing the expected total discounted reward.

  • Eτpϕ(τ)\mathbf{E}_{\tau \sim p_{\phi}(\tau)}: Expectation over trajectories sampled according to policy πϕ\pi_{\phi}.

    The paper also categorizes RL algorithms by model use (Model-free, Model-based) and interaction with the environment (Online, Off-policy, Offline).

The typical RL architectures are presented in Figure 3 from the original paper:

该图像是示意图,展示了物理信息类型(PI types)、物理信息增强强化学习方法(PIRL Methods)与强化学习流水线(RL Pipeline)之间的关系。通过箭头连接,图中展示了不同方法在强化学习架构中作用的关键环节。 该图像是示意图,展示了物理信息类型(PI types)、物理信息增强强化学习方法(PIRL Methods)与强化学习流水线(RL Pipeline)之间的关系。通过箭头连接,图中展示了不同方法在强化学习架构中作用的关键环节。

4.2.3. Definition and Intuitive Introduction to PIRL

Physics-informed RL (PIRL) is defined as the concept of incorporating physics structures, priors, and real-world physical variables into the policy learning or optimization process. The goal is to improve effectiveness, sample efficiency, and accelerated training for complex problem-solving and real-world deployment. The paper illustrates that physics priors can come in various forms, such as physical rules, mathematical equations, and physics simulators. An example of PIRL (Figure 5 from the original paper) shows how jam-avoiding distance is incorporated as a state space input in adaptive cruise control.

The map of physics incorporation in the conventional RL framework is illustrated in Figure 4 from the original paper:

Fig. 10. Example of state design, through physics incorporation. Distributed control framework for connected automated vehicles (Shi et al., 2023). Here information from downstream vehicles and roadw… 该图像是示意图,展示了基于状态融合的分布式控制框架,应用于连接自动车辆 (CAV)。图中展示了如何通过接收车辆和道路信息(例如:间隙、速度、曲率)来设计DRL状态 stls_t^l,并进行二维车跟随控制。图示还包含了通信信号和时间步变换的动态过程。

The following figure (Figure 5 from the original paper) provides an illustrative example of physics incorporation:

Fig. 11. Example of action regulation, using physics priors. In Zhao et al. (2023), a barrier certification system receives RL control policy generated control actions and refines them sequentially u… 该图像是示意图,展示了一种基于物理先验的动作调节机制。图中描述了一个动态系统与强化学习控制器之间的关系,以及使用障碍证书和控制障碍函数保证约束的过程。

4.2.4. PIRL Taxonomy

The core of the paper's methodology is the proposed three-category taxonomy, visualized in Figure 9 from the original paper, which serves as a structured lens to analyze PIRL literature.

The PIRL taxonomy is visualized in Figure 9 from the original paper:

该图像是多幅图表,展示了物理信息强化学习(PIRL)相关文献中不同强化学习算法的应用数量及相关方法的分类,包括模型学习、数据增强和物理变量等。图表展示了不同方法在研究中的使用情况及其应用领域,并指出了未来研究的机会与挑战。 该图像是多幅图表,展示了物理信息强化学习(PIRL)相关文献中不同强化学习算法的应用数量及相关方法的分类,包括模型学习、数据增强和物理变量等。图表展示了不同方法在研究中的使用情况及其应用领域,并指出了未来研究的机会与挑战。

4.2.4.1. Physics Information (types): Representation of Physics Priors

This category classifies PIRL works based on the form in which physics knowledge is represented.

  1. Differential and Algebraic Equations (DAE):

    • Description: Many works use system dynamics represented by Partial Differential Equations (PDEs) or Ordinary Differential Equations (ODEs) and boundary conditions (BCs). This physics information is often integrated through Physics-Informed Neural Networks (PINNs) or other specialized networks.
    • Example: In transient voltage control (Gao et al., 2022), a PINN learns a physical constraint from PDEs and transfers it as a loss term to the RL algorithm.
  2. Barrier Certificate and Physical Constraints (BPC):

    • Description: Used to regulate agent exploration in safety-critical RL applications. This involves concepts from control theory like Control Lyapunov Function (CLF), Barrier Function (BF), Control Barrier Function (CBF), or Control Barrier Certificate (CBC). These functions typically define a safe set of states, and a control law is devised to keep the system within this set. They can be represented as Neural Networks (NNs) and learned via data-driven approaches. Physical constraints incorporated into the RL loss also fall here.
    • Example: Neural Barrier Certificate (NBC) (Zhao et al., 2023) is learned to filter actions and ensure safety.
    • Formula: The condition for learning a NBC Bϵ(x)B_{\epsilon}(x) and filtered control action Fuψ\mathcal{F}_u^{\psi} is given as: $ \begin{array} { l } { ( \forall x \in S _ { 0 } , B _ { \epsilon } ( x ) \leq 0 ) \wedge ( \forall x \in S _ { u } , B _ { \epsilon } ( x ) > 0 ) } \ { \wedge ( \forall x \in x | B _ { \epsilon } ( x ) = 0 , \mathcal { L } _ { f ( x , u _ { R L } ) } B _ { \epsilon } ( x ) < 0 ) } \end{array} $ where:
      • xx: The state vector.
      • S0S_0: The set of initial states.
      • SuS_u: The set of unsafe states.
      • Bϵ(x)B_{\epsilon}(x): The Neural Barrier Certificate, parameterized by ϵ\epsilon, which defines the safe region (if Bϵ(x)0B_{\epsilon}(x) \leq 0, it's safe).
      • Lf(x,uRL)Bϵ(x)\mathcal{L}_{f(x, u_{RL})} B_{\epsilon}(x): The Lie derivative of Bϵ(x)B_{\epsilon}(x) along the vector field f(x,uRL)f(x, u_{RL}), representing the rate of change of Bϵ(x)B_{\epsilon}(x) under the system dynamics ff with control input uRLu_{RL}.
      • uRLu_{RL}: The action proposed by the RL agent.
      • The condition ensures that initial states are safe, unsafe states are outside the safe region, and when the system is on the boundary of the safe set (Bϵ(x)=0B_{\epsilon}(x) = 0), the control action drives it back into the safe set (making Bϵ(x)B_{\epsilon}(x) decrease or stay constant).
  3. Physics Parameters, Primitives and Physical Variables (PPV):

    • Description: Directly uses physics values extracted or derived from the environment or system. This includes physics parameters, Dynamic Movement Primitives (DMPs), physical states, or physical targets.
    • Example: In energy management (Li et al., 2023), the reward is designed to meet physical objectives like operation cost and self-energy sustainability. Jam-avoiding distance (Jurj et al., 2021) is a physical state input.
  4. Offline Data and Representation (ODR):

    • Description: Utilizes non-task-specific data collected from real systems (e.g., robots) in an offline setting, often alongside simulators, to improve sim-to-real transfer. It also includes learning physically relevant low-dimensional representations from observations.
    • Example: Hardware data collected from a real robot is used to seed simulators for training control policies (Lowrey et al., 2018). PINNs can extract physically relevant hidden state information (Gokhale et al., 2022).
  5. Physics Simulator and Model (PS):

    • Description: Simulators are used as test-beds or to impart physical correctness. This includes rigid body physics simulations and data-driven surrogate models that learn environment dynamics or enrich existing partial models.
    • Example: Rigid body physics simulations are used to solve for robot poses that closely follow motion capture clips (Chentanez et al., 2018). Lagrangian Neural Networks (LNNs) (Ramesh & Ravindran, 2023) can learn Lagrangian functions directly from interaction data.
    • Formula: For systems obeying Lagrangian mechanics, the Lagrangian L\mathcal{L} (a scalar quantity) is defined as: $ \mathcal { L } ( q , { \dot { q } } , t ) = \mathcal { T } ( q , { \dot { q } } ) - \mathcal { V } ( q ) $ where:
      • qq: Generalized coordinates.
      • q˙\dot{q}: Generalized velocities.
      • tt: Time.
      • T(q,q˙)\mathcal{T}(q, \dot{q}): Kinetic energy of the system.
      • V(q)\mathcal{V}(q): Potential energy of the system. The Lagrangian equation of motion (Euler-Lagrange equation) can be expressed as: $ \begin{array} { r l } & { \tau = M ( q ) \ddot { q } + C ( q , \dot { q } ) \dot { q } + G ( q ) , \mathbf { w h } \epsilon } \ & { } \ & { \ddot { q } = M ^ { - 1 } ( q ) ( \tau - C ( q , \dot { q } ) \dot { q } - G ( q ) ) } \end{array} $ where:
      • τ\tau: External forces/torques acting on the system.
      • M(q): Mass matrix (or inertia matrix).
      • q¨\ddot{q}: Generalized accelerations.
      • C(q,q˙)q˙C(q, \dot{q})\dot{q}: Coriolis and centrifugal terms.
      • G(q): Gravitational terms.
      • M1(q)M^{-1}(q): Inverse of the mass matrix. In LNN implementations, separate networks might learn V(q)\mathcal{V}(q) and L(q), from which acceleration (q¨)(\ddot{q}) is derived.
  6. Physical Properties (PPR):

    • Description: Fundamental knowledge about the physical structure or properties of a system.
    • Example: System morphology (Xie et al., 2016) or system symmetry (Huang et al., 2023) can be used as priors.

4.2.4.2. PIRL Methods: Physics Prior Augmentations to RL

This category describes which components of the RL paradigm are directly modified or augmented by physics information.

  1. State design: Modifies or expands the state representation to make it more instructive.

    • Example: State fusion using additional information from the environment (Jurj et al., 2021) or other agents (Shi et al., 2023), extracted features from robust representations (Cao et al., 2023a), or learned surrogate model generated data (Han et al., 2022).
  2. Action regulation: Modifies the action value, often by imposing constraints to ensure safety.

    • Example: CBFs used to restrict actions (Cheng et al., 2019; Li et al., 2021).
  3. Reward design: Incorporates physics information through effective reward design or augmentation of existing reward functions with bonuses or penalties.

    • Example: Adding an intrinsic component to the reward based on physics (Dang & Ishii, 2022), or a parametric reward function for robotic gaits (Siekmann et al., 2021).
    • Formula: The updated reward function, as given in Siekmann et al. (2021) for periodic robot behavior, is defined as a biased sum of nn reward components Ri(s,ϕ)R_i(s, \phi): $ R ( s , \phi ) = \beta + \Sigma R _ { i } ( s , \phi ) $ where each Ri(s,ϕ)R_i(s, \phi) is a product of a phase-coefficient cic_i, a phase indicator Ii(ϕ)I_i(\phi), and a phase reward measurement qi(s)q_i(s): $ R _ { i } ( s , \phi ) = c _ { i } \times I _ { i } ( \phi ) \times q _ { i } ( s ) $
      • ss: Current state.
      • ϕ\phi: Cycle time variable, typically cycling over [0, 1].
      • β\beta: A bias term.
      • Ri(s,ϕ)R_i(s, \phi): Individual reward component for a specific phase.
      • cic_i: Coefficient for the ii-th phase.
      • Ii(ϕ)I_i(\phi): Indicator function, active (e.g., 1) when the system is in phase ii.
      • qi(s)q_i(s): Measurement of desired physical characteristic during phase ii.
  4. Augment policy or value N/W: Incorporates physics principles by adjusting update rules and losses of the policy or value function networks, or by making direct changes to their underlying network structure.

    • Example: Novel physics-based losses (Mora et al., 2021), constraints for learning (Gao et al., 2022), or embedding dynamical systems as differentiable layers in the policy network (Neural Dynamic Policies (NDP)) (Bahl et al., 2020).
    • Formula: Neural Dynamic Policies (NDP) define a policy π\pi as: $ \begin{array} { r } { \pi ( a | s ; \theta ) \triangleq \Omega ( D E ( \Phi ( s ; \theta ) ) ) , \mathrm { w h e r e } } \ { D E ( w , g ) \to { y , \dot { y } , \ddot { y } } } \end{array} $ where:
      • π(as;θ)\pi(a|s; \theta): The policy, mapping state ss to action aa, parameterized by θ\theta.
      • Ω()\Omega(\cdot): An inverse controller that converts system states (y,y˙,y¨)(y, \dot{y}, \ddot{y}) into actions aa.
      • DE(w, g): Represents the solution of a second-order differential equation (dynamical system) that outputs the system states (y,y˙,y¨)(y, \dot{y}, \ddot{y}).
      • Φ(s;θ)\Phi(s; \theta): A neural network that takes the input state ss and predicts parameters (w, g) of the dynamical system.
      • ww: Weights of basis functions for the dynamical system.
      • gg: A goal for the robot. The differential equation is typically y¨=α(β(gy)y˙)+f(x)\ddot{y} = \alpha(\beta(g-y) - \dot{y}) + f(x), where α,β\alpha, \beta are global parameters for critical damping, and ff is a non-linear forcing function.
  5. Augment simulator or model: Develops improved simulators or learnable models through physics incorporation for more accurate real-world environment simulation.

    • Example: Physics-based augmentation of DNN models for accurate system learning (Ramesh & Ravindran, 2023), improved simulators for sim-to-real transfer (Golemo et al., 2018), or physics-informed learning for partially known environment models (Liu & Wang, 2021).

4.2.4.3. RL Pipeline

This category maps PIRL implementations to the different functional stages of a typical RL pipeline.

  1. Problem representation: Modeling a real-world problem as an MDP, choosing observation vectors, defining reward functions, and specifying action spaces.
  2. Learning Strategy: Decisions regarding agent-environment interaction (e.g., model use, learning architecture, RL algorithm choice).
  3. Network design: Finer details of the learning framework, including constituent units, layer types, and network depth for policy and value function networks.
  4. Training: The stage where policy and allied networks are trained, including Sim-to-real approaches to reduce simulation-reality discrepancies.
  5. Trained policy deployment: The stage where the fully trained policy is deployed to solve the task.

4.2.5. Further Categorization

The paper introduces two additional categorizations to explain the implementation side of PIRL more precisely.

4.2.5.1. Bias (in integration of physics in RL)

These categories relate to how physics knowledge or priors are integrated into ML models, drawing from PIML literature:

  1. Observational bias: Uses multi-modal data reflecting physical principles. The DNN is trained directly on observed data to capture underlying physics.

    • Example: Using offline data to improve sim-to-real transfer (Golemo et al., 2018).
  2. Learning bias: Reinforces prior physics knowledge through soft penalty constraints (extra terms in the loss function based on physics, like momentum or conservation laws).

    • Example: PINNs embedding PDEs into the loss function (Karniadakis et al., 2021).
  3. Inductive bias: Incorporates prior knowledge as hard constraints through custom neural network architectures.

    • Example: Hamiltonian NN respecting conservation laws (Greydanus et al., 2019) or Lagrangian Neural Networks (LNNs) parameterizing arbitrary Lagrangians (Cranmer et al., 2020).

4.2.5.2. Learning Architecture

This categorizes PIRL algorithms based on the alterations introduced to the conventional RL learning architecture. These are illustrated in Figure 8 from the original paper.

The learning architectures for PIRL are illustrated in Figure 8 from the original paper:

Fig. 14. Example, augmentation of policy using physics information. In Bahl et al. (2020), given an observation `s _ { t }` from the environment, a neural dynamic policy generates \(w\) i.e., the weigh… 该图像是示意图,展示了基于神经动态策略的政策增强框架。图中,给定环境状态 sts_t,神经动态政策生成权重 ww 和目标 gg,通过函数 fθf_\theta 。然后,前向积分器将这些信息用以生成一系列动作 ata_t,以执行环境中的任务;同时,机器人收集下一个状态和奖励,进而训练该策略。

  1. Safety filter: A PI-based module that regulates the agent's exploration, ensuring safety constraints.

    • Description: Takes action ata_t from RL agent πφ\pi_{\varphi} and state sts_t, refines it to a~t\tilde{a}_t.
    • Example: Control Barrier Certificates (CBCs) (Zhao et al., 2023).
  2. PI reward: Physics information is used to modify the reward function.

    • Description: PI-reward module augments the agent's extrinsic reward rtr_t with a physics-based intrinsic component, giving r~t\tilde{r}_t.
    • Example: Reward functions based on probabilistic periodic costs for robotic gaits (Siekmann et al., 2021).
  3. Residual learning: Consists of a physics-informed controller πψ\pi_{\psi} alongside a data-driven DNN-based policy πφ\pi_{\varphi} (the residual RL agent).

    • Description: The physics-informed controller provides a baseline action, and the RL agent learns to produce a residual correction.
    • Example: Combining MFRL and MBRL using CBFs (Cheng et al., 2019).
  4. Physics embedded network: Physics information (e.g., system dynamics) is directly incorporated into the policy or value function networks.

    • Example: Neural Dynamic Policies (NDP) embedding dynamical systems as differentiable layers (Bahl et al., 2020).
  5. Differentiable simulator: Uses differentiable physics simulators that explicitly provide loss gradients of simulation outcomes with respect to control actions.

    • Example: Used to compute analytic gradients of the policy's value function for monotonic improvement (Mora et al., 2021).
  6. Sim-to-real: The agent is trained in a simulator (source domain) and then transferred to a target domain (real world) for deployment, potentially with fine-tuning.

    • Example: Training a recurrent neural network on differences between simulated and actual robotic trajectories to improve the simulator (Golemo et al., 2018).
  7. Physics variable: Physical parameters, variables, or primitives are introduced to augment components (e.g., states and reward) of the RL framework.

    • Example: Jam-avoiding distance as an additional state variable (Jurj et al., 2021).
  8. Hierarchical RL: Includes hierarchical and curriculum learning-based approaches where physics is typically incorporated into meta and sub-policies and value networks.

    • Example: H-NDP learning local dynamical system-based policies and refining them globally (Bahl et al., 2021).
  9. Data augmentation: The input state is replaced with a different or augmented form, such as a low-dimensional representation to derive special and physically relevant features.

    • Example: Extracting physically relevant low-dimensional representations from observations using PINNs (Gokhale et al., 2022).
  10. PI Model Identification: In a data-driven MBRL setting, physics information is directly incorporated into the model identification process.

    • Example: Learning system models using Lagrangian Neural Networks (LNNs) (Ramesh & Ravindran, 2023).

      This comprehensive framework allows the paper to analyze how physics knowledge flows through PIRL methods and integrates into the RL pipeline, enabling a structured review and identification of research gaps.

5. Experimental Setup

As a survey paper, this work does not present its own experimental setup, datasets, or evaluation metrics in the traditional sense. Instead, it reviews and analyzes these aspects from the PIRL literature it surveys.

5.1. Datasets

The paper does not use specific datasets for its own research, but it reviews the simulation/evaluation benchmarks commonly used in the PIRL literature. These benchmarks serve as the "datasets" or environments where PIRL algorithms are trained and tested.

The review of simulation/evaluation benchmarks (Section 4.2 of the paper) highlights:

  • Standard RL Benchmarks: A majority of works dealing with dynamic control use established RL environments such as OpenAI Gym (e.g., Pusher, Striker, Mountain Car, Cart-Pole, Pendulum), Safe Gym (e.g., Cart pole swing up, Ant), MuJoCo (e.g., Ant, HalfCheetah, Humanoid, Walker2d, Inverted Pendulum), Pybullet (e.g., Franka Panda, Flexiv Rizon), and DeepMind Control Suite (e.g., Reacher, Pendulum, Cartpole).

  • Domain-Specific Benchmarks:

    • Traffic Management: Platforms like SUMO (Simulation of Urban MObility) and CARLA (Car Learning to Act) are utilized for vehicular traffic control applications.
    • Power and Voltage Management: IEEE Distribution System Benchmarks (e.g., IEEE 33-bus, 141-bus, 9-bus systems) are used to evaluate algorithms in electrical grids.
    • Robotics: Specific robotic simulators or platforms like Cassie-MuJoCo-sim, 6 DoF Kinova Jaco, Shadow dexterous hand, Gazebo (for quadrotors) are used.
  • Custom Environments: A crucial observation is that a significant number of PIRL works rely on customized or adapted environments. These are often tailored to specific domain problems (e.g., molecular structure optimization, nuclear engineering, air-traffic control, flow field reconstruction, freeform nanophotonic devices) and may be built using tools like COMSOL, DFT (Density Functional Theory) calculations, MATLAB/SIMULINK, or custom physics engines. This reliance on custom environments makes direct comparison between PIRL algorithms challenging.

    The following are the results from Table 4 of the original paper:

    Reference
    OpenAI Gym Pusher, Striker, ErgoReacher Golemo et al. (2018)
    OpenAI Gym Mountain Car, Lunar Lander (continuous) Jiang et al. (2022)
    OpenAI Gym Cart-Pole, Pendulum (simple and double) Xie et al. (2016)
    OpenAI Gym Cart-pole Cao et al. (2023c)
    OpenAI Gym Cart-pole and Quadruped robot Cao et al. (2023b)
    OpenAI Gym CartPole, Pendulum Liu and Wang (2021)
    OpenAI Gym Inverted Pendulum (pendulum ν0), Cheng et al. (2019)
    OpenAI Gym Mountain car (cont.), Pendulum, Cart pole Zhao et al. (2022)
    OpenAI Gym Simulated car following (He, Jin, & Orosz, 2018) Mukherjee and Liu (2023)
    MuJoCo Ant, HalfCheetah, Humanoid, Walker2d Humanoid standup, Swimmer, Hopper
    MuJoCo Inverted and Inverted Double Pendulum (v4) Cassie-MuJoCo-sim (robotics, Year Published/ Last Updated) Duan et al. (2021); Siekmann et al. (2021)
    6 DoF Kinova Jaco (Ghosh, Singh, Rajeswaran, Kumar, & Levine, 2017) Bahl et al. (2021, 2020)
    MuJoCo HalfCheetah, Ant, CrippledHalfCheetah, and SlimHumanoid (Zhou et al., 2018) Lee et al. (2020)
    MuJoCo Block stacking task (Janner et al., 2018) Veerapaneni et al. (2020)
    OpenAI Gym CartPole, Pendulum
    OpenSim-RL (Kidziski et al., 2018) L2M2019 environment Point, car and Doggo goal Korivand et al. (2023) Yang et al. (2023)
    Safety gym (Yuan et al., 2021) Cart pole swing up, Ant Xu et al. (2022)
    Humanoid, Humanoid MTU Chen et al. (2023b)
    Deep control suite (Tassa et al., 2018) Autonomous driving system Pendulum, Cartpole, Walker2d Sanchez-Gonzalez et al. (2018)
    Acrobot, Swimmer, Cheetah JACO arm (real world)
    Deep control suite Reacher, Pendulum, Cartpole, Ramesh and Ravindran (2023)
    Cart-2-pole, Acrobot, Cart-3-pole and Acro-3-bot
    Rabbit (Chevallereau et al., 2003) Choi et al. (2020)
    MARL env. (Lowe et al., 2017) ADROIT (Rajeswaran et al., 2017) Multi-agent particle env. Shadow dexterous hand Cai et al. (2021) (Garcia-Hernando et al., 2020)
    First-Person Hand Action Benchmark (Garcia-Hernando, Yuan,
    Baek, & Kim, 2018)
    MuJoCo Door opening, in-hand manipulation, tool use and object relocation
    SUMO (Lopez et al., 2018), (Han et al., 2022)
    METANET (Kotsialos, Papageorgiou,
    Pavlis, & Middelham, 2002) (Wang, 2022)
    SUMO (Udatha et al., 2022)
    CARLA (Dosovitskiy, Ros, Codevilla, Lopez, &
    Koltun, 2017) (Huang et al., 2023)
    Gazebo (Koenig & Howard, 2004) Quadrotor (I F750A)
    IEEE Distribution IEEE 33-bus and 141-bus distr. N/W (Chen et al., 2022)
    system benchmarks IEEE 33-node system (Cao et al., 2023a; Chen et al., 2022)
    IEEE 9-bus standard system (Gao et al., 2022)
    Custom (COMSOL based) (Alam et al., 2021)
    Custom (DFT based) (Cho et al., 2019)
    Custom (based on (Vrettos, Kara, MacDonald, Andersson, & Call- (Gokhale et al., 2022)
    away, 2016)) (Jurj et al., 2021)
    Custom (based on (Kesting, Treiber, Schönhof, Kranke, & Hel- bing, 2007))
    Custom (Dang & Ishii, 2022; Li et al., 2023; Martin & Schaub, 2022)
    Custom (Park et al., 2023; Shi et al., 2023; Yin et al., 2023)
    Custom (Yousif et al., 2023; Yu et al., 2023; Zhao et al., 2023)
    Custom (Cohen & Belta, 2023; Li & Belta, 2019; Lutter et al., 2021)
    Custom (Chen et al., 2023b; Wang et al., 2023)
    Open AI Gym Custom (Reactor geometries) Radaideh et al. (2021)
    MATLAB-Simulink Custom (She et al., 2023; Zhang et al., 2022)
    (Mora et al., 2021)
    MATLAB Custom (Zimmermann, Poranne, Bern, & Coros, 2019)
    Cruise control (Li et al., 2021)
    Pygame Custom (Zhao & Liu, 2021)
    Custom (Unicycle, Car-following) (Emam, Notomista, Glotfelter, Kira, & Egerstedt, 2022)
    Brushbot, Quadrotor (sim) (Ohnishi et al., 2019)
    Phantom manipulation platform (Lowrey et al., 2018)
    Pybullet
    2 finger gripper
    gym-pybullet-drones(Panerati et al., 2021) (Du et al., 2023)
    Pybullet
    Franka Panda, Flexiv Rizon (Lv et al., 2022)
    NimblePhysics(Werling, Omens, Lee, Exarchos, & Liu, 2021),

5.2. Evaluation Metrics

As a survey, the paper does not define or use its own evaluation metrics. However, it implicitly discusses how PIRL approaches aim to improve standard RL metrics or introduce new physics-informed ones, such as:

  • Sample Efficiency: PIRL often reduces the number of agent-environment interactions required for learning.
  • Physical Plausibility/Soundness: Ensuring solutions adhere to physical laws, which can be measured qualitatively or through specific physics-based error terms.
  • Safety Guarantees: Metrics related to avoiding unsafe states or respecting constraints (e.g., number of constraint violations).
  • Generalization: Performance on unseen scenarios or different environments, often improved by physics priors.
  • Convergence Speed: How quickly the RL algorithm reaches an optimal or near-optimal policy.
  • Accuracy: For model-based PIRL, the accuracy of the learned physics model.

5.3. Baselines

The paper does not compare its own method against baselines, as it is a review. Instead, it analyzes the RL algorithms and techniques used in the surveyed PIRL literature. These often implicitly serve as baselines within the individual PIRL papers, with PIRL approaches demonstrating improvements over purely data-driven RL counterparts. The most frequently used RL algorithms in PIRL literature are PPO, DDPG, SAC, and TD3.

6. Results & Analysis

The paper presents its findings through statistical analysis of the surveyed literature and discussions on the impact of PIRL on RL challenges and future trends.

6.1. Core Results Analysis

6.1.1. Statistical Analysis of PIRL Literature

The paper provides several statistical insights into the PIRL landscape, based on Figure 15 of the original paper.

The following figure (Figure 15 from the original paper) presents the statistical insights and trends in PIRL:

Fig. 7. PIRL taxonomy and further categories. Physics information (types), the RL methods that incorporate them and the underlying RL pipeline constitutes the PIRL Taxonomy, see Fig. 9. bias (sec. 3.…

a) Use of RL Algorithms:

  • PPO (Proximal Policy Optimization) and its variants are the most preferred RL algorithms in PIRL literature, indicating a preference for on-policy methods known for stability.
  • DDPG (Deep Deterministic Policy Gradient) is the second most popular.
  • Among newer algorithms, SAC (Soft Actor-Critic) is favored over TD3 (Twin-Delayed DDPG). This suggests a lean towards algorithms that balance sample efficiency, stability, and applicability in continuous action spaces, which are common in physical systems.

b) Types of Physics Priors Used:

  • Physics Simulator and Model (PS) and Barrier Certificate and Physical Constraints (BPC) collectively dominate, used in over 60% of works, especially in Action regulation and Augment policy and value N/W PIRL methods.
  • This highlights the criticality of safe exploration and accurate system dynamics modeling in PIRL.
  • Other significant PI types include Physics Parameters, Primitives and Physical Variables (PPV) and Differential and Algebraic Equations (DAE).

c) Learning Architecture and Bias:

  • In PI reward and safety filter architectures, physics is predominantly incorporated through learning bias. This implies heavy reliance on soft constraints, regularizers, and specialized loss functions to guide the learning process towards physical consistency.
  • Physics embedded network architectures primarily use inductive bias. This signifies the imposition of hard constraints through the design of custom, physics-aware neural network structures (e.g., Hamiltonian or Lagrangian Neural Networks).
  • This distinction clarifies whether physics is "softly" encouraged via penalties or "hardwired" into the model's fundamental structure.

d) Application Domains:

  • Approximately 85% of PIRL applications relate to controller or policy design.
  • Miscellaneous control: Accounts for the majority, including optimal control for energy management, data-center cooling, etc.
  • Safe control and exploration: Represents 25% of applications, focusing on safety-critical systems and ensuring safe learning.
  • Dynamic control: Constitutes 23% of works, involving the control of dynamic systems, including robotic systems.
  • Other applications include optimization/prediction, motion capture/simulation, and improving general policy optimization through physics incorporation. This breakdown clearly shows PIRL's strong inclination towards practical control problems, especially those requiring safety and dynamic precision.

6.1.2. Impact on Learning Efficiency

Different physics priors have varying impacts on learning efficiency:

  • Barrier Certificates and Control Constraints (BPC): Reduce exploration requirements significantly (40-50% sample reduction) by constraining action spaces to safe regions, while ensuring stability (Cheng et al., 2019; Zhao et al., 2023).
  • Differential and Algebraic Equations (DAE): Show strong sample efficiency. PINN-HJB (Hamilton-Jacobi-Bellman) implementations (Zhang et al., 2022) can reduce required samples by up to 70% compared to standard RL.
  • Physics Simulators (PS): Provide moderate initial efficiency improvements but excel at reducing real-world samples (>85%) during deployment, particularly in sim-to-real transfer (Golemo et al., 2018; Lowrey et al., 2018).
  • Physics Parameters and Variables (PPV): Offer 20-30% faster convergence and substantial robustness improvements (Jurj et al., 2021; Shi et al., 2023).
  • Combined Methods: Approaches integrating multiple physics incorporation methods (e.g., Caoetal.(2023b)Cao et al. (2023b), Li & Belta (2019)) demonstrate superior efficiency gains across all metrics, achieving up to 65% sample reduction and enhanced generalization. This emphasizes that multi-faceted integration often yields the best results.
  • Research Trends (2018-2023): Exponential growth (Figure 1), shift towards safety-critical applications using CBFs (Cheng et al., 2019; Li et al., 2021; Zhao et al., 2023), increased adoption of PPO, and growing interest in dynamic control problems. A notable trend is the combination of multiple physics incorporation methods rather than single approaches.
  • Emerging Trends (2024): PIRL is expanding into:
    • Autonomous Systems and Robotics: Control optimization for tracking moving objects (Faria et al., 2024), advanced locomotion (Ogum et al., 2024), swimming in turbulent environments (Koh et al., 2025).
    • Energy Systems: Robust voltage control (Wei et al., 2023), physics-guided multi-agent frameworks (Chen et al., 2023a), probabilistic wind power prediction (Chen et al., 2024).
    • Healthcare Applications: Optimizing arm movements for rehabilitation (Liu et al., 2024), hand-object interaction controllers (Wannawas et al., 2024).
    • Transportation Safety: Real-time optimal traffic routing (Ke et al., 2025), safe multi-agent collision avoidance (Feng et al., 2024).
    • Advanced Modeling: Hybrid planning models (Asri et al., 2024), physics-informed deep transfer RL (Zeng et al., 2024).
    • Safety Guarantees: Physics-model-guided worst-case sampling (Cao et al., 2024) and domain knowledge integration for enhanced efficiency with safety.
    • Multi-agent Collaborative Systems: Distributed energy management (Chen et al., 2024).

6.1.4. RL Challenges Addressed by PIRL

PIRL effectively addresses several open problems in RL:

  • Sample Efficiency: By augmenting simulators (sim-to-real gap reduction) or learning more accurate world models in MBRL with physics, PIRL reduces the need for extensive real-world interaction (Alam et al., 2021; Sanchez-Gonzalez et al., 2018).
  • Curse of Dimensionality: PIRL extracts physically relevant low-dimensional representations from high-dimensional state/observation spaces (Cao et al., 2023a; Gokhale et al., 2022).
  • Safety Exploration: Utilizes control theoretic concepts like CLF and CBF to guarantee safe exploration and policy deployment, restricting agents to safe state/action sets (Cai et al., 2021; Cheng et al., 2019).
  • Partial Observability/Imperfect Measurement: Modifies or enhances state representations with additional physics or geographical information to compensate for missing or inadequate data (Jurj et al., 2021; Shi et al., 2023).
  • Under-defined Reward Function: Introduces physics information into reward design or reward shaping, augmenting existing functions with bonuses or penalties that naturally align with physical objectives (Dang & Ishii, 2022; Siekmann et al., 2021).

6.2. Data Presentation (Tables)

The following are the results from Table 2 of the original paper:

Ref. Year Context/ Application RL Algorithm Learning arch. Bias Physics information PIRL methods RL pipeline
Chentanez et al. (2018) 2018 Motion capture PPO Physics reward Learning Physics simulator Reward design Problem representation
Peng, Abbeel, Levine, and Van de Panne (2018) 2018 Motion control PPO (Schulman, Wolski, Dhariwal, Radford, & Klimov, Physics reward Learning Physics simulator Reward design Problem representation
Golemo et al. 2018 Policy 2017) PPO Sim-to-Real Observational Offline data Augment Training
(2018) Lowrey et al. (2018) 2018 optimization Policy optimization NPG (Williams, 1992) (C)a Sim-to-Real Observational Offline data simulator Augment simulator Training
Cho et al. (2019) 2019 Molecular structure optimization DDPG Physics reward Learning DFT (PS) Reward design Problem representation
Li and Belta (2019) 2019 Safe exploration PPO Residual RL Learning CBF, CLF, FSA/TL (BPC) Augment simulator Reward design Training Problem representation
Bahl et al. (2020) Dynamic system PPO Phy. embed. Inductive DMP (PPV) Augment policy Augment policy Learning strategy Network design
Garcia-Hernando 2020 control Dexterous PPO N/W Residual RL Observational Physics Reward design Problem
et al. (2020) Luo et al. (2020) 2020 manipulations 3D Ego pose PPO Physics reward Learning simulator Physics State, Reward representation Problem
Bahl et al. (2021) 2021 estimation Dynamic system PPO Hierarchical RL Inductive simulator DMP (PPV) design Augment policy representation Network design
Margolis et al. 2021 control Dynamic system PPO Hierarchical RL Learning WBIC (PPV) Augment policy Learning
(2021) Alam et al. (2021) 2021 control Manufacturing SARSA (Sutton & Sim-to-Real Observational Physics engine Augment strategy Training
Siekmann et al. 2021 Dynamic system Barto, 1998) PPO Phy. variable Learning Physics simulator Reward design Problem
(2021) Li et al. (2021) 2021 control Safe exploration NFQ (Riedmiller, Safety filter Learning parameters Physical Action representation Problem
Jurj et al. (2021) 2021 and control Safe cruise 2005) SAC Phy. variable Observational constraint Physical state regulation State design representation Problem
Mora et al. (2021) 2021 control Policy DPG (C) Diff. Simulator Learning (PPV) Physics Augment policy representation Learning
Radaideh et al. 2021 optimization Optimization, DQN, PPO Physics reward Learning bias simulator Physical Reward design strategy Problem
(2021) Zhao and Liu nuclear engineering Air-traffic control PPO Data properties (PPR) representation Problem
(2021) Wang (2022) 2021 2022 Motion planner PPO + AC (Konda & augmentation Safety filter Observational Learning Representation (ODR) CBF (BPC) State design Action representation Problem
Chen et al. (2022) 2022 Active voltage Tsitsiklis, 1999) TD3 (C) Safety filter Learning Physical regulation Reward design Penalty function representation Problem
Dang and Ishii control constraints Action regulation representation Problem
(2022) Gao et al. (2022) structure prediction Transient voltage representation
Gokhale et al. 2022 control Building control DQN PINN loss Learning PDE (DAE) Augment policy State design Learning strategy Problem
(2022) Han et al. (2022) 2022 Traffic control Q-learning (C) Data augment Observational Representation (ODR) representation Problem
Martin and Schaub 2022 Q-Learning Data augment Observational Physics model State design representation Training
(2022) Jiang, Fu, and Chen 2022 Safe exploration and control SAC Sim-to-Real Observational Physics model Augment simulator Problem
(2022) 2022 Dynamic system control SAC (etc.) Physics reward Learning Barrier function Reward design representation Learning
Xu et al. (2022) 2022 Policy Learning Actor-critic (C) Diff. Simulator Learning Physics simulator Augment policy strategy
Cao et al. (2023c) 2023 Safe exploration and control DDPG Residual RL Learning Physics model Reward design Problem representation

The following are the results from Table 2 (continued) of the original paper:

Ref. Year Context/ Application RL Algorithm Learning arch. Bias Physics information PIRL methods RL pipeline
Cao et al. (2023b) 2023 Safe exploration and control DDPG Residual RL Inductive Physics model Reward design Action Problem representation
Cao et al. (2023a) Inductive N/W editing (Aug. pol.) Network design
Chen, Liu, and Di 2023 Robust voltage control SAC Data augment Observational Representation (ODR) State design Problem representation
(2023b) 2023 Mean field games DDPG Physics reward Learning Physics model Reward design Problem representation
Yang et al. (2023) 2023 Safe exploration and control PPO (C) Safety filter Learning NBC (BPC) Augment policy Training
Zhao et al. (2023) 2023 Power system stability enhancement Custom Safety filter Learning NBC (BPC) Action regulation Problem representation
Du et al. (2023) 2023 Safe exploration and control AC (C) Safety filter Learning CLBF (Dawson, Qin, Gao, & Fan, 2022; Romdlony & Jayawardhana, 2016) (BPC) Augment value N/W Training
Shi et al. (2023) 2023 Connected automated vehicles DPPO Physics variable Observational Physical state (PPV) State design Problem representation
Korivand et al. (2023) 2023 Musculoskeletal SAC (C) Physics variable Learning Learning Physical value Reward design Reward design Problem
Li et al. (2023) 2023 simulation Energy MADRL(C) Physics variable Learning Physical target Reward design representation Problem
Mukherjee and Liu 2023 management Policy PPO Phy. embed Inductive PDE (DAE) Augment value representation Network design
(2023) Yousif et al. (2023) 2023 optimization Flow field A3C N/W Physics reward Learning Physical N/W Reward design Problem
Park et al. (2023) 2023 reconstruction Freeform nanophotonic greedy Q Phy. embed Inductive constraints ABM Augment value representation Network design
Rodwell and Tallapragada 2023 devices Dynamic system control DPG N/W Curriculum Learning Physics model N/W Augment Training
(2023) She et al. (2023) 2023 Energy TD3 learning Sim-to-Real Physics model simulator Augment Learning
Yin et al. (2023) 2023 management Robot wireless PPO Physics reward Learning Physical value simulator Reward design strategy Problem
navigation representation

The following are the results from Table 3 of the original paper:
Ref. Year Context/ Application Algorithm Learning arch. Bias Physics information PIRL method RL pipeline
Xie et al. (2016) 2016 Exploration and control Model learning Observational Sys. morphology (PPR) Augment model Learning strategy
Sanchez-Gonzalez et al. (2018) 2018 Dynamic system control Model learning Inductive Physics model Augment model Learning strategy
Ohnishi et al. (2019) 2019 Safe navigation Safety filter Learning CBC (BPC) Action regulation Problem representation
Cheng et al. (2019) 2019 Safe exploration and control TRPO, DDPG Residual RL Learning CBF (BPC) Action regulation Problem representation
Veerapaneni et al. (2020) 2020 Control (visual RL) Model learning Observational Entity abstraction Augment model Learning strategy
Lee et al. (2020) 2020 Dynamic system Model learning Observational (ODR) Context encoding (ODR) Augment model Learning
Choi et al. (2020) 2020 control Safe exploration and control DDPG (Silver Safety filter Learning CBF, CLF, QP (BPC) augment policy strategy Learning strategy
Liu and Wang (2021) 2021 Dynamic system control et al., 2014) Dyna + TD3(C)a Model identification Learning PDE/ ODE, BC Augment model Learning
Duan et al. (2021) 2021 Dynamic system control PPO Residual-RL Learning (DAE) Physics model Action regulation strategy Problem representation
Cai et al. (2021) 2021 Multi agent collision MADDPG (C) Safety filter Learning CBF (BPC) Action regulation Problem representation
Lv et al. (2022) 2022 avoidance Dynamic system control TD3(C) Sim-to-Real Learning Physics Augment policy Learning
Udatha, Lyu, and Dolan (2022) 2022 Traffic control AC (Ma et al., 2021) Safety filter Learning simulator CBF (BPC) Augment model strategy Learning
Zhao et al. (2022) 2022 Safe exploration and control DDPG Safety filter Learning CBC (BPC) Augment policy strategy Learning strategy
Zhang et al. (2022) 2022 Distributed MPC AC (Jiang, Fan, Gao, Chai, & Safety filter Learning CBF (BPC) State design Problem representation
Ramesh and Ravindran (2023) 2023 Dynamic system control Lewis, 2020) Dreamer (Hafner, Lillicrap, Ba, & Phy. embed. N/W Inductive Physics model Augment model Network design
Cohen and Belta (2023) 2023 Safe exploration and control Norouzi, 2019) Safety filter Learning CBF (BPC) Augment model Learning
Huang et al. (2023) 2023 Attitude control Phy. embed N/W Inductive System symmetry (PPR) Augment model strategy Network design
Wang, Cao, Zhou, Wen, and Tan 2023 Data center cooling SAC Model identification Learning Physics laws (PPR) Augment model Learning strategy
(2023) Yu, Zhang, and Song (2023) 2023 Cooling system control DDPG Residual RL Learning CBF (BPC) Action regulation Problem representation

6.3. Ablation Studies / Parameter Analysis

The paper, being a survey, does not conduct its own ablation studies or parameter analysis. Instead, it aggregates insights from the surveyed literature regarding the impact of physics incorporation. The findings on sample efficiency (Section 4.3.2) serve as a high-level form of "ablation analysis" by comparing the efficiency gains achieved when different types of physics priors are integrated into RL systems. For instance, the observation that combining multiple physics incorporation methods yields superior efficiency gains suggests that each component contributes positively, hinting at the results of implicit ablation studies performed within individual papers.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper provides a comprehensive and timely survey of Physics-Informed Reinforcement Learning (PIRL), a rapidly growing field that bridges data-driven Deep Reinforcement Learning (DRL) with fundamental physical principles. The core contribution is a novel, unified taxonomy that systematically categorizes PIRL approaches based on the type of physics information used, the specific PIRL methods employed to augment RL components, and their integration points within the conventional RL pipeline. Additionally, it introduces categorizations by learning architecture and physics incorporation bias to offer deeper insights into implementation strategies.

The review highlights that PIRL significantly enhances RL algorithms by improving physical plausibility, precision, data efficiency, and real-world applicability. It addresses critical RL challenges such as sample efficiency (via augmented simulators and models), curse of dimensionality (through physics-informed feature extraction), safe exploration (using control theoretic tools like CBFs), partial observability (by enriching state representations), and under-defined reward functions (through physics-guided reward design). The statistical analysis reveals dominant RL algorithms (PPO, DDPG), prevalent physics priors (simulators, barrier certificates), and a strong focus on control and safety-critical applications. The paper also identifies clear research trends, including a move towards combining multiple physics incorporation methods and an expansion into diverse emerging domains like autonomous systems, energy, healthcare, and transportation safety.

7.2. Limitations & Future Work

The paper identifies several crucial limitations and open challenges, outlining promising directions for future research in PIRL:

  1. High-Dimensional Spaces:

    • Limitation: Learning compressed, informative latent spaces from high-dimensional continuous states (or actions) while ensuring physical relevance remains challenging.
    • Future Work: Leverage Physics-Informed Autoencoders (embedding PDE constraints into loss functions) and enforce physics-aware structures during training. Utilize Latent Diffusion Models to generate structured representations constrained by physical laws.
  2. Safety in Complex and Uncertain Environments:

    • Limitation: Current physics-informed approaches using control theoretic concepts (e.g., CBFs) are limited by approximated system models and prior knowledge about safe state sets. Generalization across different tasks and environments is poor.
    • Future Work: Focus on model-agnostic safe exploration and control and developing generalized methods for integrating physics into data-driven model learning. This includes formulating safety and performance as a state-constrained optimal control problem using Hamilton-Jacobi-Bellman (HJB) equations and employing conformal prediction-based verification for generalization across complex, high-dimensional environments.
  3. Choice of Physics Prior:

    • Limitation: The optimal choice of physics prior is critical yet challenging, requiring extensive study and varying significantly across systems, hindering PIRL efficiency. Current approaches often tackle tasks individually.
    • Future Work: Develop a more comprehensive framework for managing novel physical tasks, potentially through Physics-Guided Foundation Models (PGFMs) that integrate broad-domain physical knowledge to enhance robustness, generalization, and prediction reliability across diverse systems.
  4. Evaluation and Benchmarking Platform:

    • Limitation: A lack of comprehensive benchmarking and evaluation environments makes it difficult to test and compare new PIRL approaches fairly. Most works rely on customized, domain-specific environments.
    • Future Work: The field needs standardized, diverse benchmarks that allow for fair comparisons and assessment of quality and uniqueness of new PIRL methods.
  5. Scalability for Large-Scale Problems:

    • Limitation: PIRL approaches face computational bottlenecks when physics-informed components (e.g., barrier certificates, differentiable simulators) are applied to high-dimensional state spaces. Maintaining accurate physics models becomes difficult, and approximation errors can cascade.
    • Future Work: Explore hierarchical decomposition approaches, reduced-order modeling techniques, physics-guided representation learning for dimensionality reduction, and distributed PIRL architectures for multi-agent systems. Recent advancements like Hyper-Low-Rank PINN (Torres et al., 2025) and Enhanced Hybrid Adaptive PINN (Luo et al., 2025) with dynamic collocation point allocation, and Difficulty-Aware Task Sampler (DATS) (Toloubidokhti et al., 2023) are promising directions.

7.3. Personal Insights & Critique

This survey paper by Banerjee et al. is an excellent and timely contribution to the Physics-Informed Reinforcement Learning (PIRL) field. Its greatest strength lies in the meticulously crafted taxonomy, which provides a much-needed structural lens to understand a rapidly diversifying research area. Before this paper, the landscape of PIRL could appear fragmented, with various approaches integrating physics in different ways without a unifying framework. The proposed categorization by physics information type, PIRL method, and RL pipeline stage is incredibly insightful and will undoubtedly serve as a foundational reference for future research. The additional bias and learning architecture categorizations further enrich this framework, offering practical guidance for design choices.

One key inspiration drawn from this paper is the emphasis on multi-faceted physics integration. The finding that approaches combining multiple physics incorporation methods (e.g., physics-guided rewards and action regulation and network editing) often yield superior results, suggests that a holistic, layered approach to infusing domain knowledge is more powerful than relying on a single mechanism. This insight can be transferred to other ML domains where domain expertise is available but often underutilized, encouraging a more comprehensive integration strategy. For instance, in scientific machine learning applications beyond RL, one could explore combining physics-informed loss functions with physics-aware network architectures and data augmentation techniques that respect physical symmetries.

A potential area for improvement or an unverified assumption highlighted implicitly by the paper is the generalizability of physics priors. While PIRL aims for better generalization, the reliance on customized environments and the challenge of choosing the "right" physics prior for novel tasks suggest that current PIRL methods might still suffer from a form of physics overfitting. That is, the chosen physics might be highly effective for a specific system or task but not easily transferable. The future direction of Physics-Guided Foundation Models is a promising attempt to address this, but the inherent complexity of diverse physical systems means that a truly "universal" physics prior might remain elusive. Further research could explore adaptive physics prior selection or meta-learning approaches that learn which type of physics information is most relevant for a given problem class.

Another point of critique, though acknowledged by the authors, is the lack of standardized benchmarks. While the paper reviews existing benchmarks, the prevalence of "custom environments" in Tables 2, 3, and 4 underscores a significant hurdle for comparing and validating PIRL methods. This makes it difficult to ascertain whether observed performance gains are due to novel PIRL techniques or simply idiosyncratic properties of a specific custom setup. The PIRL community urgently needs a suite of diverse, standardized, and open-source benchmarks with varying levels of physical complexity and uncertainty to facilitate rigorous comparisons and accelerate progress.

Overall, this paper is a landmark survey that not only consolidates the current state of PIRL but also thoughtfully charts a course for its future, making it an invaluable resource for researchers in AI, robotics, and computational science.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.