A survey on physics informed reinforcement learning: Review and open problems
TL;DR Summary
This paper reviews the emerging field of Physics-Informed Reinforcement Learning (PIRL), introducing a novel taxonomy based on the RL pipeline to better understand current methods and identify key challenges, highlighting the potential of enhancing RL algorithms' applicability an
Abstract
The fusion of physical information in machine learning frameworks has revolutionized many application areas. This work explores their utility for reinforcement learning applications. A thorough review of the literature on the fusion of physics information in reinforcement learning approaches, commonly referred to as physics-informed reinforcement learning (PIRL), is presented. A novel taxonomy is introduced with the reinforcement learning pipeline as the backbone to classify existing works, compare and contrast them, and derive crucial insights. Existing works are analyzed with regard to the representation/form of the governing physics modeled for integration, their specific contribution to the typical reinforcement learning architecture, and their connection to the underlying reinforcement learning pipeline stages. Core learning architectures and physics incorporation biases of existing PIRL approaches are identified and used to further categorize the works for better understanding and adaptation. By providing a comprehensive perspective on the implementation of the physics-informed capability, the taxonomy presents a cohesive approach to PIRL. It identifies the areas where this approach has been applied, as well as the gaps and opportunities that exist. Additionally, the review highlights unresolved issues and challenges, while also incorporating potential and emerging solutions to guide future research. This nascent field holds great potential for enhancing reinforcement learning algorithms by increasing their physical plausibility, precision, data efficiency, and applicability in real-world scenarios.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
A survey on physics informed reinforcement learning: Review and open problems
1.2. Authors
- Chayan Banerjee (Queensland University of Technology, Brisbane, Australia)
- Kien Nguyen (Queensland University of Technology, Brisbane, Australia)
- Clinton Fookes (Queensland University of Technology, Brisbane, Australia)
- Maziar Raissi (University of Colorado, Boulder, USA)
1.3. Journal/Conference
This paper is a review article. The provided metadata indicates it was published on 2023-01-01T00:00:00.000Z. Given its scope as a comprehensive survey, it would typically be published in a reputable journal focusing on machine learning, artificial intelligence, or control systems. While the specific journal name is not provided, the affiliations of the authors and the depth of the review suggest a high-impact venue. Maziar Raissi is notably associated with Physics-Informed Neural Networks (PINNs), a foundational concept in the field, lending significant credibility to the work.
1.4. Publication Year
2023
1.5. Abstract
This work surveys the emerging field of Physics-Informed Reinforcement Learning (PIRL), which integrates physical information into Reinforcement Learning (RL) frameworks. The paper introduces a novel taxonomy, structured around the RL pipeline, to categorize existing PIRL approaches based on the form of physics modeled, its contribution to RL architectures, and its connection to RL pipeline stages. It identifies core learning architectures and physics incorporation biases, providing a cohesive understanding of PIRL implementations. The review highlights current applications, research gaps, challenges, and potential solutions, emphasizing PIRL's potential to enhance RL algorithms in terms of physical plausibility, precision, data efficiency, and real-world applicability.
1.6. Original Source Link
/files/papers/6919fecd110b75dcc59ae34d/paper.pdf (Published status: Likely an officially published paper, given the formal structure and affiliations, though the direct journal link is not provided in the prompt.)
2. Executive Summary
2.1. Background & Motivation
The core problem addressed by this paper is the inherent limitations of conventional, purely data-driven Reinforcement Learning (RL) algorithms when applied to complex real-world systems. While RL has achieved impressive results in simulations, it often struggles with real-world data due to several challenges:
-
Sample Efficiency:
RLtypically requires a vast amount of interaction data, which is often expensive or unsafe to collect in real-world environments. -
High-Dimensionality: Real-world state and action spaces are frequently continuous and high-dimensional, making learning difficult.
-
Safe Exploration: Unconstrained trial-and-error exploration can lead to dangerous or physically implausible states in safety-critical applications.
-
Disconnection between Simulation and Reality (
Sim-to-Realgap): Models trained in simulation often perform poorly when deployed in the physical world due to discrepancies. -
Under-defined Reward Functions: Designing effective reward functions that accurately guide
RLagents in complex tasks is challenging.The importance of this problem stems from the increasing demand for
RLsolutions in domains like autonomous driving, robotics, and energy systems, where physical plausibility, safety, and efficiency are paramount.
The paper's entry point is the recognition that incorporating mathematical physics into machine learning models, a paradigm known as Physics-Informed Machine Learning (PIML), has revolutionized many application areas. PIML allows neural networks to learn more efficiently from incomplete data by embedding physical laws, leading to faster training, better generalization, and physically sound solutions. Given that most RL-based solutions deal with real-world problems with inherent physical structures, RL is a natural candidate for PIML. The innovative idea is to comprehensively review and systematize the fusion of physical information into RL, termed Physics-Informed Reinforcement Learning (PIRL).
2.2. Main Contributions / Findings
The paper makes several primary contributions to the field of PIRL:
-
Unified Taxonomy: It proposes a novel and comprehensive taxonomy for
PIRL, structured around theReinforcement Learning (RL)pipeline. This taxonomy classifies works based on:- The
type of physics informationmodeled (e.g., differential equations, barrier certificates, physics parameters, simulators). - The
PIRL methodsused for augmentation (e.g., state design, action regulation, reward design, augmenting policy/value networks, augmenting simulators). - The
stages of the RL pipelinewhere physics is integrated (e.g., problem representation, learning strategy, network design, training, deployment).
- The
-
Additional Categorizations: It introduces two additional categories,
Learning ArchitectureandBias(e.g., observational, learning, inductive), to provide a more precise understanding of how physics is implemented inPIRLapproaches. -
Algorithmic Review: It provides a state-of-the-art review of
PIRLmethods, using unified notations and functional diagrams to explain the latest literature across diverse domains. -
Training and Evaluation Benchmark Review: It analyzes the evaluation benchmarks used in
PIRLliterature, identifying popular platforms and trends for easy reference. -
In-depth Analysis: It delves into how physics information is integrated into specific
RLapproaches, the types of physical processes modeled, and the network architectures utilized. -
Identification of Open Problems: It summarizes challenges, unresolved issues, and future research directions (e.g., high-dimensional spaces, safety in complex environments, choice of physics prior, benchmarking, scalability).
The key conclusions and findings are that
PIRLholds immense potential to address majorRLchallenges by:
-
Increasing Physical Plausibility: Ensuring learned policies adhere to fundamental physical laws.
-
Improving Precision and Generalization: Leading to more accurate and robust solutions that transfer better to unseen scenarios.
-
Enhancing Data Efficiency: Reducing the reliance on vast amounts of real-world data by leveraging physics knowledge.
-
Expanding Real-World Applicability: Making
RLmore viable for safety-critical and complex dynamic systems.These findings collectively provide a structured framework for understanding, developing, and evaluating
PIRLapproaches, solving the problem of a fragmented and unorganized research landscape in this nascent field.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a foundational understanding of Reinforcement Learning (RL) and Physics-Informed Machine Learning (PIML) is essential.
3.1.1. Reinforcement Learning (RL)
Reinforcement Learning (RL) is a paradigm of machine learning where an agent learns to make decisions by interacting with an environment to achieve a goal. The learning process is driven by rewards received from the environment.
-
Agent: The learner or decision-maker.
-
Environment: Everything outside the agent, with which it interacts.
-
State (): A representation of the current situation of the environment observed by the agent.
-
Action (): A decision made by the agent that affects the environment.
-
Reward (): A scalar feedback signal from the environment, indicating the desirability of the agent's action in a given state. The agent's goal is to maximize the cumulative reward over time.
-
Policy (): A strategy that maps observed states to actions. An optimal policy maximizes the expected cumulative reward.
-
Value Function: A prediction of the future reward an agent can expect from a given state (or state-action pair).
The interaction typically follows a
Markov Decision Process (MDP)framework.
3.1.2. Markov Decision Process (MDP)
A Markov Decision Process (MDP) is a mathematical framework for modeling sequential decision-making problems, which forms the backbone of most RL problems. It is formally defined by a tuple :
-
: A set of possible
statesof the environment. -
: A set of possible
actionsthe agent can take. -
: The
reward function, which defines the scalar reward received after taking action from state and transitioning to state . -
: The
transition probability function(orenvironment model), which describes the probability of transitioning to state from state after taking action . The "Markov" property means that the future state depends only on the current state and action, not on the entire history of states and actions. -
: The
discount factor, which determines the present value of future rewards. A higher values future rewards more heavily.The agent's objective in an
MDPis to find an optimal policy that maximizes the expected cumulative discounted reward over a trajectory . The probability of a trajectory under a policy is given by: $ p _ { \phi } ( \tau ) = p ( s _ { 1 } ) \displaystyle \prod _ { t = 1 } ^ { T } \pi _ { \phi } ( a _ { t } | s _ { t } ) p ( s _ { t + 1 } | s _ { t } , a _ { t } ) $ where represents the policy parameters.
The objective function to maximize is: $ \phi ^ { * } = \arg \operatorname* { m a x } _ { \phi } \underbrace { { \mathbf E } _ { \tau \sim p _ { \phi } ( \tau ) } \Big [ \sum _ { t = 1 } ^ { T } \gamma ^ { t } R ( \boldsymbol { a } _ { t } , \boldsymbol { s } _ { t + 1 } ) \Big ] } _ { \mathcal { I } ( \phi ) } $ where is the expected return, and is the discount factor.
3.1.3. Model-Free Reinforcement Learning (MFRL)
In Model-Free Reinforcement Learning (MFRL), the agent does not explicitly learn or have access to the environment model . Instead, it learns directly from interactions (experiences) with the environment. Examples include Q-learning, SARSA, Policy Gradient methods (e.g., REINFORCE, Actor-Critic), Proximal Policy Optimization (PPO), and Soft Actor-Critic (SAC). These methods are typically simpler to implement but often require a large number of samples for effective learning.
3.1.4. Model-Based Reinforcement Learning (MBRL)
In Model-Based Reinforcement Learning (MBRL), the agent attempts to learn or is provided with an explicit representation of the environment model . This model can then be used for planning, simulating future outcomes, and generating synthetic data, which can significantly improve sample efficiency. However, learning an accurate model can be challenging, and any inaccuracies in the learned model can lead to suboptimal policies or poor performance in the real environment. Examples include Dyna architectures and Model Predictive Control (MPC) with learned models.
3.1.5. Physics-Informed Machine Learning (PIML)
Physics-Informed Machine Learning (PIML) refers to a class of machine learning methods, often involving neural networks, that integrate scientific knowledge (e.g., physical laws, differential equations, conservation principles) directly into the learning process. This integration can happen through various mechanisms, such as modifying the loss function with physics-based penalties, designing specialized network architectures that inherently respect physical laws, or using physics models to generate or augment training data. The benefits of PIML include:
- Physical Consistency: Solutions adhere to fundamental physical laws.
- Data Efficiency: Less data is needed for training as physics provides a strong prior.
- Improved Generalization: Models generalize better to unseen scenarios or regions with sparse data.
- Enhanced Interpretability: The physical basis can make models more transparent.
- Faster Training: Physics constraints can guide the optimization process more effectively.
A prominent example is
Physics-Informed Neural Networks (PINNs), which embedPartial Differential Equations (PDEs)into the neural network's loss function.
3.2. Previous Works
The paper differentiates itself from other surveys by offering a focused and comprehensive review specifically on RL approaches that utilize the structure, properties, or constraints unique to the underlying physics of a process/system.
- Karniadakis et al. (2021): Provided a comprehensive review on
Machine Learning (ML)in the context ofphysics-informed (PI) methodsbut did not specifically discussRLdomain approaches. This paper fills that gap. - Hao et al. (2022): Offered an overview of
physics-informed machine learning, briefly touching uponPIRL. This current survey provides a much deeper dive into theRLaspect. - EEBerer et al. (2022): Showcased the use of prior knowledge to guide
RLalgorithms, specifically for robotic applications. They categorized knowledge into expert, world, and scientific. In contrast, the current paper's scope of application domains is broader, extending beyond robotics to motion control, molecular structure optimization, safe exploration, and robot manipulation, and introduces a novel taxonomy specifically forPIRL.
3.3. Technological Evolution
The evolution of Physics-Informed Reinforcement Learning (PIRL) can be traced through the broader trends in AI, ML, and control theory:
- Early
RL(Pre-2010s): Focused on tabular methods or linear function approximation, often in discrete, low-dimensional environments. The integration of domain knowledge was typically manual and heuristic-based. - Rise of Deep
RL(2012 onwards): With the advent ofDeep Learning (DL)(e.g.,Deep Q-Networks (DQN),AlphaGo),RLgained the ability to handle high-dimensional sensory inputs (like images) and complex control tasks. However, these methods were often purely data-driven, leading to issues with sample efficiency, safety, and physical plausibility in real-world scenarios. - Emergence of
PIML(Mid-2010s onwards): Researchers began explicitly embedding physical laws intoMLmodels (e.g.,PINNsby Raissi et al., 2019). This addressed limitations of purely data-driven models, particularly in scientific computing and engineering, by ensuring physical consistency and improving generalization with limited data. PIRLas a Converging Field (Late 2010s-Present): The natural synergy betweenPIMLandRLled to the birth ofPIRL. The realization thatRLagents operating in physical environments could greatly benefit from respecting underlying physics spurred research into integrating physics into various components of theRLpipeline:-
Safe
RL: Incorporating control theory concepts likeControl Barrier Functions (CBFs)to guarantee safety. -
Model Learning: Using physics to build more accurate and generalizable
world modelsinMBRL. -
Reward Shaping: Designing rewards that naturally encourage physically sound behavior.
-
State Representation: Creating physically informed state features to reduce dimensionality and improve interpretability.
-
Network Architectures: Developing neural network designs that inherently encode physical symmetries or dynamics.
This paper's work fits into this timeline by consolidating the diverse approaches developed within
PIRL, providing a structured overview, and identifying future directions for this rapidly growing field.
-
3.4. Differentiation Analysis
Compared to related surveys, this paper's core differences and innovations lie in its specific focus and the depth of its analytical framework for PIRL:
-
Exclusive Focus on
PIRL: Unlike broaderPIMLsurveys (e.g., Karniadakis et al., 2021; Hao et al., 2022) that cover variousMLparadigms (supervised, unsupervised,RL), this paper exclusively focuses onReinforcement Learning. This allows for a much more granular and tailored analysis of how physics interacts withRL-specific challenges and components. -
Novel Taxonomy based on
RLPipeline: The most significant innovation is the introduction of a new, unified taxonomy. This framework dissectsPIRLmethods not just by what physics is used, but how it's integrated (the specificPIRLmethod) and where in theRLpipeline this integration occurs (problem representation, learning strategy, network design, training, deployment). This provides a holistic, structured view that helps researchers understand the complete integration landscape. -
Comprehensive Categorization of Physics Information: The paper meticulously categorizes the
types of physics information(e.g.,DAE,BPC,PPV,ODR,PS,PPR). This detailed breakdown is crucial for understanding the diverse ways physical knowledge can be leveraged. -
Detailed
PIRLMethodologies and Learning Architectures: It provides an in-depth analysis of specificPIRLmethods (e.g.,state design,action regulation,reward design) and introduces novelLearning Architecturecategories (e.g.,safety filter,physics embedded network,differentiable simulator). This architectural perspective offers practical insights into implementation choices. -
Application Domain Breadth: While some prior works might focus on specific domains like robotics (e.g., EEBerer et al., 2022), this paper's review spans a much wider range of
PIRLapplications, from motion control and molecular optimization to safe exploration and energy management, demonstrating the general applicability ofPIRLprinciples. -
Identification of
Physics Incorporation Biases: By mappingPIRLapproaches toobservational,learning, andinductive biasesfrom the broaderPIMLliterature, the paper establishes a clear connection to establishedPIMLparadigms, enriching the theoretical understanding ofPIRL.In essence, this paper provides a much-needed organizational framework and a granular, multi-faceted analysis of
PIRL, making it a definitive guide for both newcomers and experienced researchers in this interdisciplinary field.
4. Methodology
This section describes the methodology employed by the survey paper itself to review and classify the Physics-Informed Reinforcement Learning (PIRL) literature. The core methodology revolves around introducing a novel, unified taxonomy and two additional categorizations to analyze existing works comprehensively.
4.1. Principles
The core idea behind the paper's methodology is to systematically organize the burgeoning field of PIRL by introducing a principled framework for understanding, comparing, and contrasting diverse approaches. The theoretical basis is that by deconstructing PIRL methods according to the nature of the physics knowledge, the specific RL components affected, and their placement within the RL pipeline, a clearer picture of the field's landscape, innovations, and gaps can emerge. The intuition is that PIRL is a multidisciplinary field, and a structured categorization can help in identifying common patterns, successful integration strategies, and areas for future research.
4.2. Core Methodology In-depth (Layer by Layer)
The paper's methodology for analyzing PIRL literature is built upon a novel taxonomy and two additional categorization schemes. The overall flow involves identifying relevant literature, extracting details about physics incorporation, and then classifying these details using the proposed framework.
4.2.1. Literature Search and Identification
The survey draws on high-quality sources, including Semantic Scholar, Google Scholar, IEEE Xplore, and SpringerLink. Keywords such as physics-informed, physics-aided, physics informed reinforcement learning, and physics priors were used to search for relevant peer-reviewed journals, conference papers, and technical reports. The volume of research identified over the years is illustrated in Figure 1 of the original paper, showing an exponential growth trend.
The following figure (Figure 1 from the original paper) illustrates the volume of research works identified in this process.
该图像是图表,展示了PIRL论文在过去七年中的发表数量增长趋势。通过条形图清晰显示,从2018年到2023年,这些论文的数量显著上升,2024年的数据点用星号标注,指示了未来的增长趋势。
4.2.2. Reinforcement Learning (RL) Fundamentals
The paper first establishes the fundamental RL framework, which serves as the backbone for its taxonomy. It defines RL using the Agent-Environment paradigm and the Markov Decision Process (MDP).
The agent-environment RL framework is presented as follows (Figure 2 from the original paper):
该图像是示意图,展示了物理信息驱动的强化学习(PIRL)各个阶段的结构与流程,包括 RL 代理的安全过滤器、PI 奖励、系统动态嵌入和其他相关组件。在这些阶段中,强化学习模型与物理信息之间的交互得以清晰呈现。
The MDP is formally represented by the tuple , where:
-
: Represents the set of
statesof the environment. -
: Represents the set of
actionsthat theRLagent can take. -
: The
reward function, which typically generates a reward due to action-induced state transition from to . -
: The
environment modelthat returns the probability of transitioning to state from . -
: The
discount factor, which determines the emphasis given to immediate rewards relative to future rewards.The
agent-environmentinteraction is recorded as atrajectory, and the closed-loop trajectory distribution for an episode is given by: $ \begin{array} { r } { p _ { \phi } ( \tau ) = p _ { \phi } ( s _ { 1 } , a _ { 1 } , s _ { 2 } , a _ { 2 } , \cdots , s _ { T } , a _ { T } , s _ { T + 1 } ) } \ { = p ( s _ { 1 } ) \displaystyle \prod _ { t = 1 } ^ { T } \pi _ { \phi } ( a _ { t } | s _ { t } ) p ( s _ { t + 1 } | s _ { t } , a _ { t } ) , } \end{array} $ where: -
: Represents the sequence of states and control actions .
-
: The initial state distribution.
-
: The agent's policy, which is a probability distribution over actions given a state, parameterized by .
-
: The environment's dynamics, the probability of the next state given the current state and action.
The objective is to find an optimal policy parameter that maximizes the expected cumulative discounted reward: $ \phi ^ { * } = \arg \operatorname* { m a x } _ { \phi } \underbrace { { \mathbf E } _ { \tau \sim p _ { \phi } ( \tau ) } \Big [ \sum _ { t = 1 } ^ { T } \gamma ^ { t } R ( \boldsymbol { a } _ { t } , \boldsymbol { s } _ { t + 1 } ) \Big ] } _ { \mathcal { I } ( \phi ) } , $ where:
-
: The objective function representing the expected total discounted reward.
-
: Expectation over trajectories sampled according to policy .
The paper also categorizes
RLalgorithms bymodel use(Model-free,Model-based) andinteraction with the environment(Online,Off-policy,Offline).
The typical RL architectures are presented in Figure 3 from the original paper:
该图像是示意图,展示了物理信息类型(PI types)、物理信息增强强化学习方法(PIRL Methods)与强化学习流水线(RL Pipeline)之间的关系。通过箭头连接,图中展示了不同方法在强化学习架构中作用的关键环节。
4.2.3. Definition and Intuitive Introduction to PIRL
Physics-informed RL (PIRL) is defined as the concept of incorporating physics structures, priors, and real-world physical variables into the policy learning or optimization process. The goal is to improve effectiveness, sample efficiency, and accelerated training for complex problem-solving and real-world deployment. The paper illustrates that physics priors can come in various forms, such as physical rules, mathematical equations, and physics simulators. An example of PIRL (Figure 5 from the original paper) shows how jam-avoiding distance is incorporated as a state space input in adaptive cruise control.
The map of physics incorporation in the conventional RL framework is illustrated in Figure 4 from the original paper:
该图像是示意图,展示了基于状态融合的分布式控制框架,应用于连接自动车辆 (CAV)。图中展示了如何通过接收车辆和道路信息(例如:间隙、速度、曲率)来设计DRL状态 ,并进行二维车跟随控制。图示还包含了通信信号和时间步变换的动态过程。
The following figure (Figure 5 from the original paper) provides an illustrative example of physics incorporation:
该图像是示意图,展示了一种基于物理先验的动作调节机制。图中描述了一个动态系统与强化学习控制器之间的关系,以及使用障碍证书和控制障碍函数保证约束的过程。
4.2.4. PIRL Taxonomy
The core of the paper's methodology is the proposed three-category taxonomy, visualized in Figure 9 from the original paper, which serves as a structured lens to analyze PIRL literature.
The PIRL taxonomy is visualized in Figure 9 from the original paper:
该图像是多幅图表,展示了物理信息强化学习(PIRL)相关文献中不同强化学习算法的应用数量及相关方法的分类,包括模型学习、数据增强和物理变量等。图表展示了不同方法在研究中的使用情况及其应用领域,并指出了未来研究的机会与挑战。
4.2.4.1. Physics Information (types): Representation of Physics Priors
This category classifies PIRL works based on the form in which physics knowledge is represented.
-
Differential and Algebraic Equations (DAE):
- Description: Many works use system dynamics represented by
Partial Differential Equations (PDEs)orOrdinary Differential Equations (ODEs)andboundary conditions (BCs). This physics information is often integrated throughPhysics-Informed Neural Networks (PINNs)or other specialized networks. - Example: In transient voltage control (Gao et al., 2022), a
PINNlearns a physical constraint fromPDEsand transfers it as a loss term to theRLalgorithm.
- Description: Many works use system dynamics represented by
-
Barrier Certificate and Physical Constraints (BPC):
- Description: Used to regulate agent exploration in safety-critical
RLapplications. This involves concepts from control theory likeControl Lyapunov Function (CLF),Barrier Function (BF),Control Barrier Function (CBF), orControl Barrier Certificate (CBC). These functions typically define asafe setof states, and a control law is devised to keep the system within this set. They can be represented asNeural Networks (NNs)and learned via data-driven approaches. Physical constraints incorporated into theRLloss also fall here. - Example:
Neural Barrier Certificate (NBC)(Zhao et al., 2023) is learned to filter actions and ensure safety. - Formula: The condition for learning a
NBCand filtered control action is given as: $ \begin{array} { l } { ( \forall x \in S _ { 0 } , B _ { \epsilon } ( x ) \leq 0 ) \wedge ( \forall x \in S _ { u } , B _ { \epsilon } ( x ) > 0 ) } \ { \wedge ( \forall x \in x | B _ { \epsilon } ( x ) = 0 , \mathcal { L } _ { f ( x , u _ { R L } ) } B _ { \epsilon } ( x ) < 0 ) } \end{array} $ where:- : The state vector.
- : The set of initial states.
- : The set of unsafe states.
- : The
Neural Barrier Certificate, parameterized by , which defines the safe region (if , it's safe). - : The
Lie derivativeof along the vector field , representing the rate of change of under the system dynamics with control input . - : The action proposed by the
RLagent. - The condition ensures that initial states are safe, unsafe states are outside the safe region, and when the system is on the boundary of the safe set (), the control action drives it back into the safe set (making decrease or stay constant).
- Description: Used to regulate agent exploration in safety-critical
-
Physics Parameters, Primitives and Physical Variables (PPV):
- Description: Directly uses physics values extracted or derived from the environment or system. This includes physics parameters,
Dynamic Movement Primitives (DMPs), physical states, or physical targets. - Example: In energy management (Li et al., 2023), the reward is designed to meet physical objectives like operation cost and self-energy sustainability.
Jam-avoiding distance(Jurj et al., 2021) is a physical state input.
- Description: Directly uses physics values extracted or derived from the environment or system. This includes physics parameters,
-
Offline Data and Representation (ODR):
- Description: Utilizes non-task-specific data collected from real systems (e.g., robots) in an
offlinesetting, often alongside simulators, to improvesim-to-real transfer. It also includes learning physically relevant low-dimensional representations from observations. - Example: Hardware data collected from a real robot is used to seed simulators for training control policies (Lowrey et al., 2018).
PINNscan extract physically relevant hidden state information (Gokhale et al., 2022).
- Description: Utilizes non-task-specific data collected from real systems (e.g., robots) in an
-
Physics Simulator and Model (PS):
- Description: Simulators are used as test-beds or to impart physical correctness. This includes
rigid body physics simulationsanddata-driven surrogate modelsthat learn environment dynamics or enrich existing partial models. - Example: Rigid body physics simulations are used to solve for robot poses that closely follow motion capture clips (Chentanez et al., 2018).
Lagrangian Neural Networks (LNNs)(Ramesh & Ravindran, 2023) can learnLagrangian functionsdirectly from interaction data. - Formula: For systems obeying
Lagrangian mechanics, theLagrangian(a scalar quantity) is defined as: $ \mathcal { L } ( q , { \dot { q } } , t ) = \mathcal { T } ( q , { \dot { q } } ) - \mathcal { V } ( q ) $ where:- : Generalized coordinates.
- : Generalized velocities.
- : Time.
- : Kinetic energy of the system.
- : Potential energy of the system.
The
Lagrangian equation of motion(Euler-Lagrange equation) can be expressed as: $ \begin{array} { r l } & { \tau = M ( q ) \ddot { q } + C ( q , \dot { q } ) \dot { q } + G ( q ) , \mathbf { w h } \epsilon } \ & { } \ & { \ddot { q } = M ^ { - 1 } ( q ) ( \tau - C ( q , \dot { q } ) \dot { q } - G ( q ) ) } \end{array} $ where: - : External forces/torques acting on the system.
M(q): Mass matrix (or inertia matrix).- : Generalized accelerations.
- : Coriolis and centrifugal terms.
G(q): Gravitational terms.- : Inverse of the mass matrix.
In
LNNimplementations, separate networks might learn andL(q), from which acceleration is derived.
- Description: Simulators are used as test-beds or to impart physical correctness. This includes
-
Physical Properties (PPR):
- Description: Fundamental knowledge about the physical structure or properties of a system.
- Example: System
morphology(Xie et al., 2016) orsystem symmetry(Huang et al., 2023) can be used as priors.
4.2.4.2. PIRL Methods: Physics Prior Augmentations to RL
This category describes which components of the RL paradigm are directly modified or augmented by physics information.
-
State design: Modifies or expands the
state representationto make it more instructive.- Example: State fusion using additional information from the environment (Jurj et al., 2021) or other agents (Shi et al., 2023), extracted features from robust representations (Cao et al., 2023a), or learned surrogate model generated data (Han et al., 2022).
-
Action regulation: Modifies the
action value, often by imposing constraints to ensure safety.- Example:
CBFsused to restrict actions (Cheng et al., 2019; Li et al., 2021).
- Example:
-
Reward design: Incorporates physics information through effective
reward designor augmentation of existing reward functions with bonuses or penalties.- Example: Adding an intrinsic component to the reward based on physics (Dang & Ishii, 2022), or a parametric reward function for robotic gaits (Siekmann et al., 2021).
- Formula: The updated reward function, as given in Siekmann et al. (2021) for periodic robot behavior, is defined as a biased sum of reward components :
$
R ( s , \phi ) = \beta + \Sigma R _ { i } ( s , \phi )
$
where each is a product of a
phase-coefficient, aphase indicator, and aphase reward measurement: $ R _ { i } ( s , \phi ) = c _ { i } \times I _ { i } ( \phi ) \times q _ { i } ( s ) $- : Current state.
- : Cycle time variable, typically cycling over [0, 1].
- : A bias term.
- : Individual reward component for a specific phase.
- : Coefficient for the -th phase.
- : Indicator function, active (e.g., 1) when the system is in phase .
- : Measurement of desired physical characteristic during phase .
-
Augment policy or value N/W: Incorporates physics principles by adjusting update rules and losses of the
policyorvalue function networks, or by making direct changes to their underlyingnetwork structure.- Example: Novel physics-based losses (Mora et al., 2021), constraints for learning (Gao et al., 2022), or embedding
dynamical systemsas differentiable layers in the policy network (Neural Dynamic Policies (NDP)) (Bahl et al., 2020). - Formula:
Neural Dynamic Policies (NDP)define a policy as: $ \begin{array} { r } { \pi ( a | s ; \theta ) \triangleq \Omega ( D E ( \Phi ( s ; \theta ) ) ) , \mathrm { w h e r e } } \ { D E ( w , g ) \to { y , \dot { y } , \ddot { y } } } \end{array} $ where:- : The policy, mapping state to action , parameterized by .
- : An inverse controller that converts system states into actions .
DE(w, g): Represents the solution of a second-order differential equation (dynamical system) that outputs the system states .- : A neural network that takes the input state and predicts parameters
(w, g)of the dynamical system. - : Weights of basis functions for the dynamical system.
- : A goal for the robot. The differential equation is typically , where are global parameters for critical damping, and is a non-linear forcing function.
- Example: Novel physics-based losses (Mora et al., 2021), constraints for learning (Gao et al., 2022), or embedding
-
Augment simulator or model: Develops improved simulators or
learnable modelsthrough physics incorporation for more accuratereal-world environmentsimulation.- Example: Physics-based augmentation of
DNNmodels for accurate system learning (Ramesh & Ravindran, 2023), improved simulators forsim-to-real transfer(Golemo et al., 2018), or physics-informed learning for partially known environment models (Liu & Wang, 2021).
- Example: Physics-based augmentation of
4.2.4.3. RL Pipeline
This category maps PIRL implementations to the different functional stages of a typical RL pipeline.
- Problem representation: Modeling a real-world problem as an
MDP, choosing observation vectors, defining reward functions, and specifying action spaces. - Learning Strategy: Decisions regarding
agent-environmentinteraction (e.g., model use, learning architecture,RLalgorithm choice). - Network design: Finer details of the learning framework, including constituent units, layer types, and network depth for
policyandvalue function networks. - Training: The stage where
policyand allied networks are trained, includingSim-to-realapproaches to reduce simulation-reality discrepancies. - Trained policy deployment: The stage where the fully trained policy is deployed to solve the task.
4.2.5. Further Categorization
The paper introduces two additional categorizations to explain the implementation side of PIRL more precisely.
4.2.5.1. Bias (in integration of physics in RL)
These categories relate to how physics knowledge or priors are integrated into ML models, drawing from PIML literature:
-
Observational bias: Uses multi-modal data reflecting physical principles. The
DNNis trained directly on observed data to capture underlying physics.- Example: Using
offline datato improvesim-to-real transfer(Golemo et al., 2018).
- Example: Using
-
Learning bias: Reinforces prior physics knowledge through
soft penalty constraints(extra terms in the loss function based on physics, like momentum or conservation laws).- Example:
PINNsembeddingPDEsinto the loss function (Karniadakis et al., 2021).
- Example:
-
Inductive bias: Incorporates prior knowledge as
hard constraintsthroughcustom neural networkarchitectures.- Example:
Hamiltonian NNrespecting conservation laws (Greydanus et al., 2019) orLagrangian Neural Networks (LNNs)parameterizing arbitraryLagrangians(Cranmer et al., 2020).
- Example:
4.2.5.2. Learning Architecture
This categorizes PIRL algorithms based on the alterations introduced to the conventional RL learning architecture. These are illustrated in Figure 8 from the original paper.
The learning architectures for PIRL are illustrated in Figure 8 from the original paper:
该图像是示意图,展示了基于神经动态策略的政策增强框架。图中,给定环境状态 ,神经动态政策生成权重 和目标 ,通过函数 。然后,前向积分器将这些信息用以生成一系列动作 ,以执行环境中的任务;同时,机器人收集下一个状态和奖励,进而训练该策略。
-
Safety filter: A
PI-based modulethat regulates the agent's exploration, ensuring safety constraints.- Description: Takes action from
RLagent and state , refines it to . - Example:
Control Barrier Certificates (CBCs)(Zhao et al., 2023).
- Description: Takes action from
-
PI reward: Physics information is used to modify the reward function.
- Description:
PI-reward moduleaugments the agent's extrinsic reward with a physics-based intrinsic component, giving . - Example: Reward functions based on
probabilistic periodic costsfor robotic gaits (Siekmann et al., 2021).
- Description:
-
Residual learning: Consists of a
physics-informed controlleralongside adata-driven DNN-based policy(the residualRLagent).- Description: The physics-informed controller provides a baseline action, and the
RLagent learns to produce a residual correction. - Example: Combining
MFRLandMBRLusingCBFs(Cheng et al., 2019).
- Description: The physics-informed controller provides a baseline action, and the
-
Physics embedded network: Physics information (e.g., system dynamics) is directly incorporated into the
policyorvalue function networks.- Example:
Neural Dynamic Policies (NDP)embeddingdynamical systemsas differentiable layers (Bahl et al., 2020).
- Example:
-
Differentiable simulator: Uses
differentiable physics simulatorsthat explicitly provide loss gradients of simulation outcomes with respect to control actions.- Example: Used to compute analytic gradients of the
policy's value functionfor monotonic improvement (Mora et al., 2021).
- Example: Used to compute analytic gradients of the
-
Sim-to-real: The agent is trained in a simulator (source domain) and then transferred to a target domain (real world) for deployment, potentially with fine-tuning.
- Example: Training a
recurrent neural networkon differences between simulated and actual robotic trajectories to improve the simulator (Golemo et al., 2018).
- Example: Training a
-
Physics variable: Physical parameters, variables, or primitives are introduced to augment components (e.g., states and reward) of the
RLframework.- Example:
Jam-avoiding distanceas an additional state variable (Jurj et al., 2021).
- Example:
-
Hierarchical RL: Includes
hierarchicalandcurriculum learning-based approaches where physics is typically incorporated intometaandsub-policiesandvalue networks.- Example:
H-NDPlearning local dynamical system-based policies and refining them globally (Bahl et al., 2021).
- Example:
-
Data augmentation: The input state is replaced with a different or augmented form, such as a
low-dimensional representationto derive special and physically relevant features.- Example: Extracting physically relevant low-dimensional representations from observations using
PINNs(Gokhale et al., 2022).
- Example: Extracting physically relevant low-dimensional representations from observations using
-
PI Model Identification: In a
data-driven MBRLsetting, physics information is directly incorporated into the model identification process.-
Example: Learning system models using
Lagrangian Neural Networks (LNNs)(Ramesh & Ravindran, 2023).This comprehensive framework allows the paper to analyze how physics knowledge flows through
PIRLmethods and integrates into theRLpipeline, enabling a structured review and identification of research gaps.
-
5. Experimental Setup
As a survey paper, this work does not present its own experimental setup, datasets, or evaluation metrics in the traditional sense. Instead, it reviews and analyzes these aspects from the PIRL literature it surveys.
5.1. Datasets
The paper does not use specific datasets for its own research, but it reviews the simulation/evaluation benchmarks commonly used in the PIRL literature. These benchmarks serve as the "datasets" or environments where PIRL algorithms are trained and tested.
The review of simulation/evaluation benchmarks (Section 4.2 of the paper) highlights:
-
Standard
RLBenchmarks: A majority of works dealing with dynamic control use establishedRLenvironments such asOpenAI Gym(e.g., Pusher, Striker, Mountain Car, Cart-Pole, Pendulum),Safe Gym(e.g., Cart pole swing up, Ant),MuJoCo(e.g., Ant, HalfCheetah, Humanoid, Walker2d, Inverted Pendulum),Pybullet(e.g., Franka Panda, Flexiv Rizon), andDeepMind Control Suite(e.g., Reacher, Pendulum, Cartpole). -
Domain-Specific Benchmarks:
- Traffic Management: Platforms like
SUMO(Simulation of Urban MObility) andCARLA(Car Learning to Act) are utilized for vehicular traffic control applications. - Power and Voltage Management:
IEEE Distribution System Benchmarks(e.g., IEEE 33-bus, 141-bus, 9-bus systems) are used to evaluate algorithms in electrical grids. - Robotics: Specific robotic simulators or platforms like
Cassie-MuJoCo-sim,6 DoF Kinova Jaco,Shadow dexterous hand,Gazebo(for quadrotors) are used.
- Traffic Management: Platforms like
-
Custom Environments: A crucial observation is that a significant number of
PIRLworks rely on customized or adapted environments. These are often tailored to specific domain problems (e.g., molecular structure optimization, nuclear engineering, air-traffic control, flow field reconstruction, freeform nanophotonic devices) and may be built using tools likeCOMSOL,DFT (Density Functional Theory)calculations,MATLAB/SIMULINK, or custom physics engines. This reliance on custom environments makes direct comparison betweenPIRLalgorithms challenging.The following are the results from Table 4 of the original paper:
Reference OpenAI Gym Pusher, Striker, ErgoReacher Golemo et al. (2018) OpenAI Gym Mountain Car, Lunar Lander (continuous) Jiang et al. (2022) OpenAI Gym Cart-Pole, Pendulum (simple and double) Xie et al. (2016) OpenAI Gym Cart-pole Cao et al. (2023c) OpenAI Gym Cart-pole and Quadruped robot Cao et al. (2023b) OpenAI Gym CartPole, Pendulum Liu and Wang (2021) OpenAI Gym Inverted Pendulum (pendulum ν0), Cheng et al. (2019) OpenAI Gym Mountain car (cont.), Pendulum, Cart pole Zhao et al. (2022) OpenAI Gym Simulated car following (He, Jin, & Orosz, 2018) Mukherjee and Liu (2023) MuJoCo Ant, HalfCheetah, Humanoid, Walker2d Humanoid standup, Swimmer, Hopper MuJoCo Inverted and Inverted Double Pendulum (v4) Cassie-MuJoCo-sim (robotics, Year Published/ Last Updated) Duan et al. (2021); Siekmann et al. (2021) 6 DoF Kinova Jaco (Ghosh, Singh, Rajeswaran, Kumar, & Levine, 2017) Bahl et al. (2021, 2020) MuJoCo HalfCheetah, Ant, CrippledHalfCheetah, and SlimHumanoid (Zhou et al., 2018) Lee et al. (2020) MuJoCo Block stacking task (Janner et al., 2018) Veerapaneni et al. (2020) OpenAI Gym CartPole, Pendulum OpenSim-RL (Kidziski et al., 2018) L2M2019 environment Point, car and Doggo goal Korivand et al. (2023) Yang et al. (2023) Safety gym (Yuan et al., 2021) Cart pole swing up, Ant Xu et al. (2022) Humanoid, Humanoid MTU Chen et al. (2023b) Deep control suite (Tassa et al., 2018) Autonomous driving system Pendulum, Cartpole, Walker2d Sanchez-Gonzalez et al. (2018) Acrobot, Swimmer, Cheetah JACO arm (real world) Deep control suite Reacher, Pendulum, Cartpole, Ramesh and Ravindran (2023) Cart-2-pole, Acrobot, Cart-3-pole and Acro-3-bot Rabbit (Chevallereau et al., 2003) Choi et al. (2020) MARL env. (Lowe et al., 2017) ADROIT (Rajeswaran et al., 2017) Multi-agent particle env. Shadow dexterous hand Cai et al. (2021) (Garcia-Hernando et al., 2020) First-Person Hand Action Benchmark (Garcia-Hernando, Yuan, Baek, & Kim, 2018) MuJoCo Door opening, in-hand manipulation, tool use and object relocation SUMO (Lopez et al., 2018), (Han et al., 2022) METANET (Kotsialos, Papageorgiou, Pavlis, & Middelham, 2002) (Wang, 2022) SUMO (Udatha et al., 2022) CARLA (Dosovitskiy, Ros, Codevilla, Lopez, & Koltun, 2017) (Huang et al., 2023) Gazebo (Koenig & Howard, 2004) Quadrotor (I F750A) IEEE Distribution IEEE 33-bus and 141-bus distr. N/W (Chen et al., 2022) system benchmarks IEEE 33-node system (Cao et al., 2023a; Chen et al., 2022) IEEE 9-bus standard system (Gao et al., 2022) Custom (COMSOL based) (Alam et al., 2021) Custom (DFT based) (Cho et al., 2019) Custom (based on (Vrettos, Kara, MacDonald, Andersson, & Call- (Gokhale et al., 2022) away, 2016)) (Jurj et al., 2021) Custom (based on (Kesting, Treiber, Schönhof, Kranke, & Hel- bing, 2007)) Custom (Dang & Ishii, 2022; Li et al., 2023; Martin & Schaub, 2022) Custom (Park et al., 2023; Shi et al., 2023; Yin et al., 2023) Custom (Yousif et al., 2023; Yu et al., 2023; Zhao et al., 2023) Custom (Cohen & Belta, 2023; Li & Belta, 2019; Lutter et al., 2021) Custom (Chen et al., 2023b; Wang et al., 2023) Open AI Gym Custom (Reactor geometries) Radaideh et al. (2021) MATLAB-Simulink Custom (She et al., 2023; Zhang et al., 2022) (Mora et al., 2021) MATLAB Custom (Zimmermann, Poranne, Bern, & Coros, 2019) Cruise control (Li et al., 2021) Pygame Custom (Zhao & Liu, 2021) Custom (Unicycle, Car-following) (Emam, Notomista, Glotfelter, Kira, & Egerstedt, 2022) Brushbot, Quadrotor (sim) (Ohnishi et al., 2019) Phantom manipulation platform (Lowrey et al., 2018) Pybullet 2 finger gripper gym-pybullet-drones(Panerati et al., 2021) (Du et al., 2023) Pybullet Franka Panda, Flexiv Rizon (Lv et al., 2022) NimblePhysics(Werling, Omens, Lee, Exarchos, & Liu, 2021),
5.2. Evaluation Metrics
As a survey, the paper does not define or use its own evaluation metrics. However, it implicitly discusses how PIRL approaches aim to improve standard RL metrics or introduce new physics-informed ones, such as:
- Sample Efficiency:
PIRLoften reduces the number ofagent-environmentinteractions required for learning. - Physical Plausibility/Soundness: Ensuring solutions adhere to physical laws, which can be measured qualitatively or through specific physics-based error terms.
- Safety Guarantees: Metrics related to avoiding unsafe states or respecting constraints (e.g., number of constraint violations).
- Generalization: Performance on unseen scenarios or different environments, often improved by physics priors.
- Convergence Speed: How quickly the
RLalgorithm reaches an optimal or near-optimal policy. - Accuracy: For model-based
PIRL, the accuracy of the learned physics model.
5.3. Baselines
The paper does not compare its own method against baselines, as it is a review. Instead, it analyzes the RL algorithms and techniques used in the surveyed PIRL literature. These often implicitly serve as baselines within the individual PIRL papers, with PIRL approaches demonstrating improvements over purely data-driven RL counterparts. The most frequently used RL algorithms in PIRL literature are PPO, DDPG, SAC, and TD3.
6. Results & Analysis
The paper presents its findings through statistical analysis of the surveyed literature and discussions on the impact of PIRL on RL challenges and future trends.
6.1. Core Results Analysis
6.1.1. Statistical Analysis of PIRL Literature
The paper provides several statistical insights into the PIRL landscape, based on Figure 15 of the original paper.
The following figure (Figure 15 from the original paper) presents the statistical insights and trends in PIRL:

a) Use of RL Algorithms:
PPO (Proximal Policy Optimization)and its variants are the most preferredRLalgorithms inPIRLliterature, indicating a preference for on-policy methods known for stability.DDPG (Deep Deterministic Policy Gradient)is the second most popular.- Among newer algorithms,
SAC (Soft Actor-Critic)is favored overTD3 (Twin-Delayed DDPG). This suggests a lean towards algorithms that balance sample efficiency, stability, and applicability in continuous action spaces, which are common in physical systems.
b) Types of Physics Priors Used:
Physics Simulator and Model (PS)andBarrier Certificate and Physical Constraints (BPC)collectively dominate, used in over 60% of works, especially inAction regulationandAugment policy and value N/WPIRLmethods.- This highlights the criticality of
safe explorationand accuratesystem dynamics modelinginPIRL. - Other significant
PItypes includePhysics Parameters, Primitives and Physical Variables (PPV)andDifferential and Algebraic Equations (DAE).
c) Learning Architecture and Bias:
- In
PI rewardandsafety filterarchitectures, physics is predominantly incorporated throughlearning bias. This implies heavy reliance onsoft constraints,regularizers, andspecialized loss functionsto guide the learning process towards physical consistency. Physics embedded networkarchitectures primarily useinductive bias. This signifies the imposition ofhard constraintsthrough the design of custom, physics-aware neural network structures (e.g.,HamiltonianorLagrangian Neural Networks).- This distinction clarifies whether physics is "softly" encouraged via penalties or "hardwired" into the model's fundamental structure.
d) Application Domains:
- Approximately 85% of
PIRLapplications relate tocontroller or policy design. - Miscellaneous control: Accounts for the majority, including optimal control for energy management, data-center cooling, etc.
- Safe control and exploration: Represents 25% of applications, focusing on safety-critical systems and ensuring safe learning.
- Dynamic control: Constitutes 23% of works, involving the control of dynamic systems, including robotic systems.
- Other applications include
optimization/prediction,motion capture/simulation, andimproving general policy optimizationthrough physics incorporation. This breakdown clearly showsPIRL's strong inclination towards practical control problems, especially those requiring safety and dynamic precision.
6.1.2. Impact on Learning Efficiency
Different physics priors have varying impacts on learning efficiency:
- Barrier Certificates and Control Constraints (BPC): Reduce exploration requirements significantly (40-50% sample reduction) by constraining action spaces to safe regions, while ensuring stability (Cheng et al., 2019; Zhao et al., 2023).
- Differential and Algebraic Equations (DAE): Show strong sample efficiency.
PINN-HJB(Hamilton-Jacobi-Bellman) implementations (Zhang et al., 2022) can reduce required samples by up to 70% compared to standardRL. - Physics Simulators (PS): Provide moderate initial efficiency improvements but excel at reducing real-world samples (>85%) during deployment, particularly in
sim-to-realtransfer (Golemo et al., 2018; Lowrey et al., 2018). - Physics Parameters and Variables (PPV): Offer 20-30% faster convergence and substantial robustness improvements (Jurj et al., 2021; Shi et al., 2023).
- Combined Methods: Approaches integrating multiple physics incorporation methods (e.g., ,
Li & Belta (2019)) demonstrate superior efficiency gains across all metrics, achieving up to 65% sample reduction and enhanced generalization. This emphasizes that multi-faceted integration often yields the best results.
6.1.3. Trends in PIRL Research
- Research Trends (2018-2023): Exponential growth (Figure 1), shift towards safety-critical applications using
CBFs(Cheng et al., 2019; Li et al., 2021; Zhao et al., 2023), increased adoption ofPPO, and growing interest in dynamic control problems. A notable trend is the combination of multiple physics incorporation methods rather than single approaches. - Emerging Trends (2024):
PIRLis expanding into:- Autonomous Systems and Robotics: Control optimization for tracking moving objects (Faria et al., 2024), advanced locomotion (Ogum et al., 2024), swimming in turbulent environments (Koh et al., 2025).
- Energy Systems: Robust voltage control (Wei et al., 2023),
physics-guided multi-agent frameworks(Chen et al., 2023a), probabilistic wind power prediction (Chen et al., 2024). - Healthcare Applications: Optimizing arm movements for rehabilitation (Liu et al., 2024),
hand-object interaction controllers(Wannawas et al., 2024). - Transportation Safety: Real-time optimal traffic routing (Ke et al., 2025),
safe multi-agent collision avoidance(Feng et al., 2024). - Advanced Modeling:
Hybrid planning models(Asri et al., 2024),physics-informed deep transfer RL(Zeng et al., 2024). - Safety Guarantees:
Physics-model-guided worst-case sampling(Cao et al., 2024) and domain knowledge integration for enhanced efficiency with safety. - Multi-agent Collaborative Systems: Distributed energy management (Chen et al., 2024).
6.1.4. RL Challenges Addressed by PIRL
PIRL effectively addresses several open problems in RL:
- Sample Efficiency: By augmenting simulators (
sim-to-realgap reduction) or learning more accurateworld modelsinMBRLwith physics,PIRLreduces the need for extensive real-world interaction (Alam et al., 2021; Sanchez-Gonzalez et al., 2018). - Curse of Dimensionality:
PIRLextracts physically relevant low-dimensional representations from high-dimensional state/observation spaces (Cao et al., 2023a; Gokhale et al., 2022). - Safety Exploration: Utilizes control theoretic concepts like
CLFandCBFto guarantee safe exploration and policy deployment, restricting agents to safe state/action sets (Cai et al., 2021; Cheng et al., 2019). - Partial Observability/Imperfect Measurement: Modifies or enhances state representations with additional physics or geographical information to compensate for missing or inadequate data (Jurj et al., 2021; Shi et al., 2023).
- Under-defined Reward Function: Introduces physics information into
reward designorreward shaping, augmenting existing functions with bonuses or penalties that naturally align with physical objectives (Dang & Ishii, 2022; Siekmann et al., 2021).
6.2. Data Presentation (Tables)
The following are the results from Table 2 of the original paper:
| Ref. | Year | Context/ Application | RL Algorithm | Learning arch. | Bias | Physics information | PIRL methods | RL pipeline |
| Chentanez et al. (2018) | 2018 | Motion capture | PPO | Physics reward | Learning | Physics simulator | Reward design | Problem representation |
| Peng, Abbeel, Levine, and Van de Panne (2018) | 2018 | Motion control | PPO (Schulman, Wolski, Dhariwal, Radford, & Klimov, | Physics reward | Learning | Physics simulator | Reward design | Problem representation |
| Golemo et al. | 2018 | Policy | 2017) PPO | Sim-to-Real | Observational | Offline data | Augment | Training |
| (2018) Lowrey et al. (2018) | 2018 | optimization Policy optimization | NPG (Williams, 1992) (C)a | Sim-to-Real | Observational | Offline data | simulator Augment simulator | Training |
| Cho et al. (2019) | 2019 | Molecular structure optimization | DDPG | Physics reward | Learning | DFT (PS) | Reward design | Problem representation |
| Li and Belta (2019) | 2019 | Safe exploration | PPO | Residual RL | Learning | CBF, CLF, FSA/TL (BPC) | Augment simulator Reward design | Training Problem representation |
| Bahl et al. (2020) | Dynamic system | PPO | Phy. embed. | Inductive | DMP (PPV) | Augment policy Augment policy | Learning strategy Network design | |
| Garcia-Hernando | 2020 | control Dexterous | PPO | N/W Residual RL | Observational | Physics | Reward design | Problem |
| et al. (2020) Luo et al. (2020) | 2020 | manipulations 3D Ego pose | PPO | Physics reward | Learning | simulator Physics | State, Reward | representation Problem |
| Bahl et al. (2021) | 2021 | estimation Dynamic system | PPO | Hierarchical RL | Inductive | simulator DMP (PPV) | design Augment policy | representation Network design |
| Margolis et al. | 2021 | control Dynamic system | PPO | Hierarchical RL | Learning | WBIC (PPV) | Augment policy | Learning |
| (2021) Alam et al. (2021) | 2021 | control Manufacturing | SARSA (Sutton & | Sim-to-Real | Observational | Physics engine | Augment | strategy Training |
| Siekmann et al. | 2021 | Dynamic system | Barto, 1998) PPO | Phy. variable | Learning | Physics | simulator Reward design | Problem |
| (2021) Li et al. (2021) | 2021 | control Safe exploration | NFQ (Riedmiller, | Safety filter | Learning | parameters Physical | Action | representation Problem |
| Jurj et al. (2021) | 2021 | and control Safe cruise | 2005) SAC | Phy. variable | Observational | constraint Physical state | regulation State design | representation Problem |
| Mora et al. (2021) | 2021 | control Policy | DPG (C) | Diff. Simulator | Learning | (PPV) Physics | Augment policy | representation Learning |
| Radaideh et al. | 2021 | optimization Optimization, | DQN, PPO | Physics reward | Learning bias | simulator Physical | Reward design | strategy Problem |
| (2021) Zhao and Liu | nuclear engineering Air-traffic control | PPO | Data | properties (PPR) | representation Problem | |||
| (2021) Wang (2022) | 2021 2022 | Motion planner | PPO + AC (Konda & | augmentation Safety filter | Observational Learning | Representation (ODR) CBF (BPC) | State design Action | representation Problem |
| Chen et al. (2022) | 2022 | Active voltage | Tsitsiklis, 1999) TD3 (C) | Safety filter | Learning | Physical | regulation Reward design Penalty function | representation Problem |
| Dang and Ishii | control | constraints | Action regulation | representation Problem | ||||
| (2022) Gao et al. (2022) | structure prediction Transient voltage | representation | ||||||
| Gokhale et al. | 2022 | control Building control | DQN | PINN loss | Learning | PDE (DAE) | Augment policy State design | Learning strategy Problem |
| (2022) Han et al. (2022) | 2022 | Traffic control | Q-learning (C) | Data augment | Observational | Representation (ODR) | representation Problem | |
| Martin and Schaub | 2022 | Q-Learning | Data augment | Observational | Physics model | State design | representation Training | |
| (2022) Jiang, Fu, and Chen | 2022 | Safe exploration and control | SAC | Sim-to-Real | Observational | Physics model | Augment simulator | Problem |
| (2022) | 2022 | Dynamic system control | SAC (etc.) | Physics reward | Learning | Barrier function | Reward design | representation Learning |
| Xu et al. (2022) | 2022 | Policy Learning | Actor-critic (C) | Diff. Simulator | Learning | Physics simulator | Augment policy | strategy |
| Cao et al. (2023c) | 2023 | Safe exploration and control | DDPG | Residual RL | Learning | Physics model | Reward design | Problem representation |
The following are the results from Table 2 (continued) of the original paper:
| Ref. | Year | Context/ Application | RL Algorithm | Learning arch. | Bias | Physics information | PIRL methods | RL pipeline |
| Cao et al. (2023b) | 2023 | Safe exploration and control | DDPG | Residual RL | Inductive | Physics model | Reward design Action | Problem representation |
| Cao et al. (2023a) | Inductive | N/W editing (Aug. pol.) | Network design | |||||
| Chen, Liu, and Di | 2023 | Robust voltage control | SAC | Data augment | Observational | Representation (ODR) | State design | Problem representation |
| (2023b) | 2023 | Mean field games | DDPG | Physics reward | Learning | Physics model | Reward design | Problem representation |
| Yang et al. (2023) | 2023 | Safe exploration and control | PPO (C) | Safety filter | Learning | NBC (BPC) | Augment policy | Training |
| Zhao et al. (2023) | 2023 | Power system stability enhancement | Custom | Safety filter | Learning | NBC (BPC) | Action regulation | Problem representation |
| Du et al. (2023) | 2023 | Safe exploration and control | AC (C) | Safety filter | Learning | CLBF (Dawson, Qin, Gao, & Fan, 2022; Romdlony & Jayawardhana, 2016) (BPC) | Augment value N/W | Training |
| Shi et al. (2023) | 2023 | Connected automated vehicles | DPPO | Physics variable | Observational | Physical state (PPV) | State design | Problem representation |
| Korivand et al. (2023) | 2023 | Musculoskeletal | SAC (C) | Physics variable | Learning Learning | Physical value | Reward design Reward design | Problem |
| Li et al. (2023) | 2023 | simulation Energy | MADRL(C) | Physics variable | Learning | Physical target | Reward design | representation Problem |
| Mukherjee and Liu | 2023 | management Policy | PPO | Phy. embed | Inductive | PDE (DAE) | Augment value | representation Network design |
| (2023) Yousif et al. (2023) | 2023 | optimization Flow field | A3C | N/W Physics reward | Learning | Physical | N/W Reward design | Problem |
| Park et al. (2023) | 2023 | reconstruction Freeform nanophotonic | greedy Q | Phy. embed | Inductive | constraints ABM | Augment value | representation Network design |
| Rodwell and Tallapragada | 2023 | devices Dynamic system control | DPG | N/W Curriculum | Learning | Physics model | N/W Augment | Training |
| (2023) She et al. (2023) | 2023 | Energy | TD3 | learning Sim-to-Real | Physics model | simulator Augment | Learning | |
| Yin et al. (2023) | 2023 | management Robot wireless | PPO | Physics reward | Learning | Physical value | simulator Reward design | strategy Problem |
| navigation | representation |
The following are the results from Table 3 of the original paper:
| Ref. | Year | Context/ Application | Algorithm | Learning arch. | Bias | Physics information | PIRL method | RL pipeline |
| Xie et al. (2016) | 2016 | Exploration and control | Model learning | Observational | Sys. morphology (PPR) | Augment model | Learning strategy | |
| Sanchez-Gonzalez et al. (2018) | 2018 | Dynamic system control | Model learning | Inductive | Physics model | Augment model | Learning strategy | |
| Ohnishi et al. (2019) | 2019 | Safe navigation | Safety filter | Learning | CBC (BPC) | Action regulation | Problem representation | |
| Cheng et al. (2019) | 2019 | Safe exploration and control | TRPO, DDPG | Residual RL | Learning | CBF (BPC) | Action regulation | Problem representation |
| Veerapaneni et al. (2020) | 2020 | Control (visual RL) | Model learning | Observational | Entity abstraction | Augment model | Learning strategy | |
| Lee et al. (2020) | 2020 | Dynamic system | Model learning | Observational | (ODR) Context encoding (ODR) | Augment model | Learning | |
| Choi et al. (2020) | 2020 | control Safe exploration and control | DDPG (Silver | Safety filter | Learning | CBF, CLF, QP (BPC) | augment policy | strategy Learning strategy |
| Liu and Wang (2021) | 2021 | Dynamic system control | et al., 2014) Dyna + TD3(C)a | Model identification | Learning | PDE/ ODE, BC | Augment model | Learning |
| Duan et al. (2021) | 2021 | Dynamic system control | PPO | Residual-RL | Learning | (DAE) Physics model | Action regulation | strategy Problem representation |
| Cai et al. (2021) | 2021 | Multi agent collision | MADDPG (C) | Safety filter | Learning | CBF (BPC) | Action regulation | Problem representation |
| Lv et al. (2022) | 2022 | avoidance Dynamic system control | TD3(C) | Sim-to-Real | Learning | Physics | Augment policy | Learning |
| Udatha, Lyu, and Dolan (2022) | 2022 | Traffic control | AC (Ma et al., 2021) | Safety filter | Learning | simulator CBF (BPC) | Augment model | strategy Learning |
| Zhao et al. (2022) | 2022 | Safe exploration and control | DDPG | Safety filter | Learning | CBC (BPC) | Augment policy | strategy Learning strategy |
| Zhang et al. (2022) | 2022 | Distributed MPC | AC (Jiang, Fan, Gao, Chai, & | Safety filter | Learning | CBF (BPC) | State design | Problem representation |
| Ramesh and Ravindran (2023) | 2023 | Dynamic system control | Lewis, 2020) Dreamer (Hafner, Lillicrap, Ba, & | Phy. embed. N/W | Inductive | Physics model | Augment model | Network design |
| Cohen and Belta (2023) | 2023 | Safe exploration and control | Norouzi, 2019) | Safety filter | Learning | CBF (BPC) | Augment model | Learning |
| Huang et al. (2023) | 2023 | Attitude control | Phy. embed N/W | Inductive | System symmetry (PPR) | Augment model | strategy Network design | |
| Wang, Cao, Zhou, Wen, and Tan | 2023 | Data center cooling | SAC | Model identification | Learning | Physics laws (PPR) | Augment model | Learning strategy |
| (2023) Yu, Zhang, and Song (2023) | 2023 | Cooling system control | DDPG | Residual RL | Learning | CBF (BPC) | Action regulation | Problem representation |
6.3. Ablation Studies / Parameter Analysis
The paper, being a survey, does not conduct its own ablation studies or parameter analysis. Instead, it aggregates insights from the surveyed literature regarding the impact of physics incorporation. The findings on sample efficiency (Section 4.3.2) serve as a high-level form of "ablation analysis" by comparing the efficiency gains achieved when different types of physics priors are integrated into RL systems. For instance, the observation that combining multiple physics incorporation methods yields superior efficiency gains suggests that each component contributes positively, hinting at the results of implicit ablation studies performed within individual papers.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper provides a comprehensive and timely survey of Physics-Informed Reinforcement Learning (PIRL), a rapidly growing field that bridges data-driven Deep Reinforcement Learning (DRL) with fundamental physical principles. The core contribution is a novel, unified taxonomy that systematically categorizes PIRL approaches based on the type of physics information used, the specific PIRL methods employed to augment RL components, and their integration points within the conventional RL pipeline. Additionally, it introduces categorizations by learning architecture and physics incorporation bias to offer deeper insights into implementation strategies.
The review highlights that PIRL significantly enhances RL algorithms by improving physical plausibility, precision, data efficiency, and real-world applicability. It addresses critical RL challenges such as sample efficiency (via augmented simulators and models), curse of dimensionality (through physics-informed feature extraction), safe exploration (using control theoretic tools like CBFs), partial observability (by enriching state representations), and under-defined reward functions (through physics-guided reward design). The statistical analysis reveals dominant RL algorithms (PPO, DDPG), prevalent physics priors (simulators, barrier certificates), and a strong focus on control and safety-critical applications. The paper also identifies clear research trends, including a move towards combining multiple physics incorporation methods and an expansion into diverse emerging domains like autonomous systems, energy, healthcare, and transportation safety.
7.2. Limitations & Future Work
The paper identifies several crucial limitations and open challenges, outlining promising directions for future research in PIRL:
-
High-Dimensional Spaces:
- Limitation: Learning compressed, informative latent spaces from high-dimensional continuous states (or actions) while ensuring physical relevance remains challenging.
- Future Work: Leverage
Physics-Informed Autoencoders(embedding PDE constraints into loss functions) and enforcephysics-aware structuresduring training. UtilizeLatent Diffusion Modelsto generate structured representations constrained by physical laws.
-
Safety in Complex and Uncertain Environments:
- Limitation: Current
physics-informedapproaches using control theoretic concepts (e.g.,CBFs) are limited by approximated system models and prior knowledge about safe state sets. Generalization across different tasks and environments is poor. - Future Work: Focus on
model-agnostic safe exploration and controland developinggeneralized methods for integrating physics into data-driven model learning. This includes formulating safety and performance as a state-constrained optimal control problem usingHamilton-Jacobi-Bellman (HJB)equations and employingconformal prediction-based verificationfor generalization across complex, high-dimensional environments.
- Limitation: Current
-
Choice of Physics Prior:
- Limitation: The optimal choice of
physics prioris critical yet challenging, requiring extensive study and varying significantly across systems, hinderingPIRLefficiency. Current approaches often tackle tasks individually. - Future Work: Develop a more comprehensive framework for managing novel physical tasks, potentially through
Physics-Guided Foundation Models (PGFMs)that integrate broad-domain physical knowledge to enhance robustness, generalization, and prediction reliability across diverse systems.
- Limitation: The optimal choice of
-
Evaluation and Benchmarking Platform:
- Limitation: A lack of comprehensive benchmarking and evaluation environments makes it difficult to test and compare new
PIRLapproaches fairly. Most works rely on customized, domain-specific environments. - Future Work: The field needs standardized, diverse benchmarks that allow for fair comparisons and assessment of quality and uniqueness of new
PIRLmethods.
- Limitation: A lack of comprehensive benchmarking and evaluation environments makes it difficult to test and compare new
-
Scalability for Large-Scale Problems:
- Limitation:
PIRLapproaches face computational bottlenecks whenphysics-informed components(e.g.,barrier certificates,differentiable simulators) are applied to high-dimensional state spaces. Maintaining accurate physics models becomes difficult, and approximation errors can cascade. - Future Work: Explore
hierarchical decomposition approaches,reduced-order modeling techniques,physics-guided representation learningfor dimensionality reduction, anddistributed PIRL architecturesfor multi-agent systems. Recent advancements likeHyper-Low-Rank PINN(Torres et al., 2025) andEnhanced Hybrid Adaptive PINN(Luo et al., 2025) with dynamic collocation point allocation, andDifficulty-Aware Task Sampler (DATS)(Toloubidokhti et al., 2023) are promising directions.
- Limitation:
7.3. Personal Insights & Critique
This survey paper by Banerjee et al. is an excellent and timely contribution to the Physics-Informed Reinforcement Learning (PIRL) field. Its greatest strength lies in the meticulously crafted taxonomy, which provides a much-needed structural lens to understand a rapidly diversifying research area. Before this paper, the landscape of PIRL could appear fragmented, with various approaches integrating physics in different ways without a unifying framework. The proposed categorization by physics information type, PIRL method, and RL pipeline stage is incredibly insightful and will undoubtedly serve as a foundational reference for future research. The additional bias and learning architecture categorizations further enrich this framework, offering practical guidance for design choices.
One key inspiration drawn from this paper is the emphasis on multi-faceted physics integration. The finding that approaches combining multiple physics incorporation methods (e.g., physics-guided rewards and action regulation and network editing) often yield superior results, suggests that a holistic, layered approach to infusing domain knowledge is more powerful than relying on a single mechanism. This insight can be transferred to other ML domains where domain expertise is available but often underutilized, encouraging a more comprehensive integration strategy. For instance, in scientific machine learning applications beyond RL, one could explore combining physics-informed loss functions with physics-aware network architectures and data augmentation techniques that respect physical symmetries.
A potential area for improvement or an unverified assumption highlighted implicitly by the paper is the generalizability of physics priors. While PIRL aims for better generalization, the reliance on customized environments and the challenge of choosing the "right" physics prior for novel tasks suggest that current PIRL methods might still suffer from a form of physics overfitting. That is, the chosen physics might be highly effective for a specific system or task but not easily transferable. The future direction of Physics-Guided Foundation Models is a promising attempt to address this, but the inherent complexity of diverse physical systems means that a truly "universal" physics prior might remain elusive. Further research could explore adaptive physics prior selection or meta-learning approaches that learn which type of physics information is most relevant for a given problem class.
Another point of critique, though acknowledged by the authors, is the lack of standardized benchmarks. While the paper reviews existing benchmarks, the prevalence of "custom environments" in Tables 2, 3, and 4 underscores a significant hurdle for comparing and validating PIRL methods. This makes it difficult to ascertain whether observed performance gains are due to novel PIRL techniques or simply idiosyncratic properties of a specific custom setup. The PIRL community urgently needs a suite of diverse, standardized, and open-source benchmarks with varying levels of physical complexity and uncertainty to facilitate rigorous comparisons and accelerate progress.
Overall, this paper is a landmark survey that not only consolidates the current state of PIRL but also thoughtfully charts a course for its future, making it an invaluable resource for researchers in AI, robotics, and computational science.
Similar papers
Recommended via semantic vector search.