Model-Based Safe Reinforcement Learning with Time-Varying State and Control Constraints: An Application to Intelligent Vehicles

Recently, safe reinforcement learning (RL) with the actor-critic structure for continuous control tasks has received increasing attention. It is still challenging to learn a near-optimal control policy with safety and convergence guarantees. Also, few works have addressed the safe RL algorithm design under time-varying safety constraints. This paper proposes a safe RL algorithm for optimal control of nonlinear systems with time-varying state and control constraints. In the proposed approach, we construct a novel barrier force-based control policy structure to guarantee control safety. A multi-step policy evaluation mechanism is proposed to predict the policy's safety risk under time-varying safety constraints and guide the policy to update safely. Theoretical results on stability and robustness are proven. Also, the convergence of the actor-critic implementation is analyzed. The performance of the proposed algorithm outperforms several state-of-the-art RL algorithms in the simulated Safety Gym environment. Furthermore, the approach is applied to the integrated path following and collision avoidance problem for two real-world intelligent vehicles. A differential-drive vehicle and an Ackermann-drive one are used to verify offline deployment and online learning performance, respectively. Our approach shows an impressive sim-to-real transfer capability and a satisfactory online control performance in the experiment.

driving has been viewed as a promising technology that will bring fundamental changes to everyday life. Still, one of the crucial issues concerns how to learn to drive safely under dynamic and unknown environments with unexpected obstacles [12]. For these practical reasons, many safe RL algorithms have been recently developed for safety-critical systems, see e.g. [10], [11], [13]- [25] and the references therein. Note that there are fruitful works in adaptive control with constraints, but the technique used differs from that in RL, for related references in adaptive control might refer to [26], [27].
In general, current safe RL solutions can be categorized into three main approaches. (i) The first family utilizes a unique mechanism in the learning procedure for safe policy optimization using, e.g., control barrier functions [17], [21], [28], formal verification [29], [30], shielding [31]- [33], and external intervention [34], [35]. These methods are prone to safe-biased learning by sacrificing greatly on performance. And some of them rely on extra-human interference [34], [35]. (ii) The second family proposes safe RL algorithms via primal-dual methods [16], [36]- [38]. In the resulting optimization problem, the Lagrangian multiplier serves as an extra weight whose update is noted sensitive to the control performance [16]. Moreover, some optimization problems such as those with ellipsoidal constraints (covered in this work) could not satisfy the strong duality condition [39]. (iii) The third is reward/cost shaping-based RL approaches [40]- [43] where the cost functions are augmented with various safetyrelated parts, e.g., barrier functions. As stated in [37], such a design only informs the goal of guaranteeing safety by minimizing the reshaped cost function but fails to guide how to achieve it well through an actor-critic structure design. Specifically, the reshaped cost function values, which are usually evaluated in discrete-time instants, might change rapidly when the system variables approach the constraint boundary. Consequently, the weights of actor and critic networks are prone to divergence in the training process. These issues motivated our barrier force-based (simplified as barrier-based) actor-critic structure. In this work, we incorporate this unique structure into a control-theory-based RL framework, where model-based multi-step policy evaluation mechanism is also utilized to ensure convergence and safety in online learning scenarios. Moreover, few works have addressed the safe RL algorithm design under time-varying safety constraints.
This work proposes a model-based safe RL algorithm with theoretical guarantees for optimal control with time-varying arXiv:2112.11217v3 [cs.LG] 13 Aug 2023 state and control constraints. A new barrier force-based control policy (BCP) structure is constructed in the proposed safe RL approach, generating repulsive control forces as states and controls move toward the constraint boundaries. Moreover, the time-varying constraints are addressed by a multi-step policy evaluation (MPE). The proposed safe RL approach is implemented by an online barrier force-based actor-critic learning algorithm. The closed-loop theoretical property of our approach under nominal and perturbed cases and the convergence condition of the barrier-based actor-critic (BAC) learning algorithm is derived. The effectiveness of our approach is tested on both simulations and real-world intelligent vehicles. The effectiveness of our approach is tested on both simulations and real-world intelligent vehicles. Our contributions are summarized as follows.
(i) We proposed a safe RL for optimal control under timevarying state and control constraints. Under certain conditions (see Sections IV-A and -B), safety can be guaranteed in both online and offline learning scenarios. The performance and advantages of the proposed approach are achieved by two novel designs. The first is a barrier force-based control policy to ensure safety with an actor-critic structure. The second is a multi-step evaluation mechanism to predict the control policy's future influence on the value function under time-varying safety constraints and guide the policy to update safely.
(ii) We proved that the proposed safe RL could guarantee stability and robustness in the nominal scenario and under external disturbances, respectively. Also, the convergence condition of the actor-critic implementation was derived by the Lyapunov method.
(iii) The proposed approach was applied to solve an integrated path following and collision avoidance problem of intelligent vehicles so that the control performance can be optimized with theoretical guarantees even with external disturbances. a) Extensive simulation results illustrate that our approach outperforms other state-of-the-art safe RL methods in learning safety and performance. b) We verified our approach's offline sim-to-real transfer capability and real-world online learning performance, as well as the strengths to stateof-the-art model predictive control (MPC) algorithms.
The remainder of the paper is organized as follows. Section II introduces the considered control problem and preliminary solutions. Section III presents the proposed safe RL approach and the BAC learning algorithm, while Section IV presents the main theoretical results. Section V shows the real-world experimental results, while some conclusions are drawn in Section VI. Some proofs of the theoretical results and additional experimental results are given in the appendix.
Notation: We denote N and N b a as the set of natural numbers and integers a, a + 1, · · · , b. For a vector x ∈ R n , we denote ∥x∥ 2 Q as x ⊤ Qx and ∥x∥ as the Euclidean norm. For a function f (x) with an argument x, we denote ▽f (x) as the gradient to x. For a function f (x, u) with arguments x and u, we denote ▽ z f (x, u) as the partial gradient to z, z = x or u. Given a matrix A ∈ R n×n , we use λ min (A) (λ max (A)) to denote the minimal (maximal) eigenvalues. We denote Int(Z) as the interior of a general set Z. For variables

II. PROBLEM FORMULATION
In this section, we first describe the considered model and the associated safety constraints. Then, the optimal control objective and preliminary RL results are given. Finally, the safe RL problem formulation by cost reconstruction with barrier functions is presented.

A. System Model and Constraints
The considered system under control is a class of discretetime nonlinear systems described by where x k ∈ X k ⊆ R n and u k ∈ U k ⊆ R m are the state and input variables, k is the discrete-time index, , are assumed to be C 2 ; f is a smooth state transition function and f (0, 0) = 0.
In principle, different types of state constraints can be formalized as follows. For instance, are the center and radius of the circular dynamic obstacle respectively.
Definition 1 (Local stabilizability [39]): System (1) is stabilizable on X k × U k if, for any x 0 ∈ X k , there exists a C 1 state-feedback policy u(x k ) ∈ U k , u(0) = 0, such that x k ∈ X k and x k → 0 as k → +∞.
Assumption 1 (Lipschitz continuous): Model (1) is Lipschitz continuous in X k × U k , for all k ∈ N ∞ 1 , i.e., there exists a Lipschitz constant 0 < L f < +∞ such that for all x 1 , x 2 ∈ X k and C 1 control policies with u( (2)

B. Control Objective
Starting from any initial condition x 0 ∈ X 0 , the control objective is to find an optimal control policy u * k = u * (x k ) ∈ U k that minimizes a quadratic regulation cost function of type subject to model (1), x k ∈ X k , and u k ∈ U k , ∀k ∈ N; where and Q = Q ⊤ ∈ R n×n , R = R ⊤ ∈ R m×m , Q, R ≻ 0, γ is a discounting factor. Without loss of generality, many waypoint tracking problems in the robot control field can be naturally formed as the prescribed regulation one, with a proper coordination transformation of the reference waypoints. More generally, it is allowed that the time-varying state constraint might not contain the origin for some k ∈ N. Typical examples can be found in, for instance, path following of mobile robots with collision avoidance, where the potential obstacle to be avoided might occupy the reference waypoints, i.e., the origin after coordination transformation. In this scenario, it is still reasonable to introduce the following assumption for convergence guarantee.
Assumption 3 (State constraint): There exists a finite number k ∈ N such that {0} ⊆ X k as k ≥k.
Definition 2 (Multi-step safe control): For a given state x k ∈ X k at time instant k, a control policy u(x k ) ∈ U k , is L-step safe for (1) if the resulting future state evolutions of (1) under u(x k ) satisfy x k+i ∈ X u k+i , ∀i ∈ N L 1 , where X u k+i is the resulting state constraint under u(x k ).
To simplify the notation, in the rest of the paper, the super index in X u k is neglected, i.e., we use X k to denote X u k .

C. Preliminary Reinforcement Learning Solutions
In a special case where only control constraint is considered, i.e., X k = R n , the optimal value function can be defined by which satisfies the Hamilton-Jacobi-Bellman (HJB) equation and the optimal control policy is Various RL solutions have been contributed to solving optimization problem (5) with (6), resorting to an actor-critic approximating structure (cf. [4], [5], [44]). In control problems with state constraints, solving (5) with (6) with safety guarantee is more challenging using the trial-and-error-based actorcritic reinforcement learning framework. Recently, various safe RL solutions have been emerged [10], [11], [13]- [25]. However, few works have addressed the safe RL algorithm design under time-varying safety constraints. One of the typical safe RL algorithms is to shape the cost (3) with barrier functions (see [40]- [43]), which could be insufficient in many cases to guide how to ensure safety by actor-critic learning. Take the considered discrete-time optimal control problem as an example. Note that the gradients of barrier functions change rapidly when the state or control approach constraints boundaries. On this ground, the reshaped cost function with barrier functions might experience abrupt changes since it is evaluated at discrete-time instants (in most cases the sampling interval is not chosen small enough). As a result, the weights of the actor-critic networks are prone to divergence due to the abrupt cost value changes in the training process. For these reasons, we propose a safe RL approach using a BCP structure and an MPE mechanism for optimal control under time-varying safety constraints.

D. Definition on Barrier Functions
As policy improvement is usually performed by the gradient descent method in actor-critic RL, we have to reconstruct the cost function in (3) by incorporating continuous barrier functions of state and control constraints. To this end, we first introduce a definition of barrier functions as follows.
Definition 3 (Barrier function [45]): For a general convex set Z k = {z ∈ R l |G i z,k (z) ≤ 0, ∀i ∈ N pz 1 }, a barrier function is defined as To derive a satisfactory control performance, we define a recentered transformation of This definition leads to the property that B c k (z) ≥ 0 and it reaches the minimum at z c , i.e., For the case {0} ⊈ Z k , we suggest selecting z c far from the set boundary of Z k and as the central point or its neighbor of Z k (if possible). Lemma 1 (Relaxed barrier function [45]): Define a relaxed barrier function of B c k (z) as where the relaxing factor κ b > 0 is a small positive number,σ k = min i∈N pz 1 − G i k (z), the function γ b (z,σ k ) is strictly monotone and differentiable on (−∞, κ b ), and Proof: For details please see [45]. □

III. SAFE RL WITH BCP AND MPE
This section presents our safe RL approach and its implementation by an efficient barrier-based actor-critic learning algorithm. Our safe RL approach has two novel ingredients. The first is a barrier force-based control policy structure, which has physics force interpretations to ensure safety. The second is a multi-step policy evaluation mechanism, which provides the multi-step safety risk prediction to guide the policy to update safely online under time-varying constraints.

A. Design of Safe RL with BCP and MPE
We reconstruct the performance index J(x k ) with state and control barrier functions defined in (7). Letting µ > 0 be a tuning parameter, the resulting value function, denoted as J(x k ), is defined as wherer(x k , u k ) = r(x k , u k )+µB k (u k )+µB k (x k ). Note that, in addition to the logarithmic barrier function (7), other general types of differentiable barrier functions such as exponential, polynomial ones can be naturally used instead to construct J(x k ); which is, however, beyond the scope of this work. Proposition 1 (Unconstrained control problem equivalence): Letting s k (x k ) = B k (x k ), the control problem for (1) with cost (10) is equivalent to an unconstrained optimal control problem for the system with the cost in (10) being rewritten as where Q y = diag{Q, µ}.
Proof. First, note thatJ u in (12) is equivalent toJ in (10) since y k = (x k , s k (x k )) and s k (x k ) = B k (x k ). Hence, the control problem for (11) with cost (12) is equivalent to that for (11) with cost (10), and consequently to that for (1) with cost (10) by disregarding the definition y k = (x k , s k (x k )) in (11). □ In (12), the overall objective function consists of the classical quadratic-type regulation costs and the barrier functions on the state and control. The tuning parameter µ determines the influence of the barrier function values on the overall objective function. Given µ, the barrier functions even become dominant if the control and state are close to the boundaries of the safety constraints. Indeed, the parameter µ (concerning Q and R) represents a trade-off between optimality and learning safety.
To solve the control problem withJ(x k ), we propose a novel barrier force-based control policy inspired by the barrier method in interior-point optimization [46], i.e., where v k ∈ R m is a new virtual control input, ρ ∈ R and K ∈ R m×n are decision variables to be further optimized (see also Section IV); Remark 1: In (13), the roles of the second and third terms are to generate the repulsive forces, respectively, as the variables x and v move toward the corresponding boundary of the constraints. As a result, (13) generates joint forces to exactly balance the forces associated with J(x k ) and the barrier functions inJ(x k ). Hence, our control policy has physics force interpretations to ensure safety.
Let at any time instant k, the L-step ahead control policy be u(x τ ) ∀τ ∈ N L−1 0 where L ∈ N. Hence, one can write the following difference equation for the multi-step prediction of the stage cost under u(x), i.e., Under control (13), lettingJ * (x k ) be the optimal value function at time instant k, a variant of the discrete-time HJB equation can be written as and the optimal solution is We propose a safe RL algorithm with barrier-based control policy (BCP) and multi-step policy evaluation (MPE) in Algorithm 1 to solve u * (x k ) andJ * (x k ).

Algorithm 1 Safe RL with BCP and MPE
3) Barrier-based control policy update: Remark 3: Note that model-free RL algorithms have received considerable attention in continuous control tasks [3]. However, model-free approaches still have the data-inefficient issue, suitable for specific tasks with valid datasets [17], [48]. In our case, we focus on a model-based framework with safety and convergence guarantees because it is more suitable for our concerned real-world safety-critical vehicle control tasks. The extension of our approach to the model-free case will be left for further investigation.

B. Barrier-based Actor-Critic Implementation
In the following, Algorithm 1 is implemented with a barrierbased actor-critic learning structure. We first construct a consistent type of critic network toJ in (10) with barrier functions:Ĵ where W c1 ∈ R Nc and W c2 ∈ R are weighting matrices, σ c ∈ R Nc is a vector composed of basis functions. In a collective form, we writeĴ( The ultimate goal of the critic network is to minimize the distance betweenJ * andĴ via updating W c . However, as J * is not available, the followingJ d (x k ) (defined according to (15a)) is used as the target to be steered byĴ, i.e., Let ε c,k =J d (x k ) −Ĵ(x k ) be the approximation residual, δ c,k = ε 2 c,k , and γ c be the learning rate, then the update rule of weight W c according to the gradient descent is given as We next design the actor network for learning the control policy (13) with the following form where W a,σ ∈ R Nu×m ,K ∈ R m×n , andρ ∈ R are the weighting matrices, σ a ∈ R Nu is a vector of basis functions.
In view of (15b) and (19), letting Denote ε a,k = ν d k − ν k as the approximation residual, δ a,k = ∥ε a,k ∥ 2 , and γ a be the learning rate, then the update rule of W a1 andρ according to the gradient descent is given as For a visual display of the barrier-based actor-critic learning algorithm, please see Fig Remark 4: Although we introduce barrier-based terms in the actor-critic structure for policy learning, the resulting algorithm implementation procedure is comparable (only slightly more complex) to a standard actor-critic structure since all the weights can be learned with standard gradient decent rules (see (18) and (20)) and only a few more weighting matrices W c,2 ,K, andρ are to be updated alongside.

IV. THEORETICAL RESULTS
The theoretical properties of the proposed safe RL in nominal and disturbance scenarios are proven in Section IV-A and Section IV-B respectively. Then, the convergence analysis of the barrier-based actor-critic learning algorithm under timeinvariant constraints is given in Section IV-C.

A. Safety and Stability Guarantees in Nominal Scenario
In the following, we prove the convergence of our proposed safe RL with BCP and MPE, i.e., Algorithm 1 under timevarying constraints.
Assumption 4 (Stabilizability): For any x k ∈ X k , there exist v(x k ), ρ, K constituting a control policy u(x k ) ∈ U k such that system (11) is locally stabilizable under (13).
Specifically, by setting ρ = 0 and K = 0, Assumption 4 is equivalent to the standard local stabilizability condition as in [39] when the state and control constraints are timeinvariant. From Assumption 4, one can promptly derive the following L-step safe control condition: given x k ∈ X k , there exists an L-step safe control policy such that x k+l ∈ X k+l ∀l ∈ N L 1 . Note that this condition is equivalent to the existence of a 1-step safe control policy since the former can be derived from the latter using mathematical induction. To verify the above safe control condition, the variation of the state constraints can not be arbitrarily large. LetX k+1 = {x k+1 |x k+1 = f (x k , u k ), ∀x k ∈ X k , u k ∈ U k } be the maximal reachable set from X k under U k . We require that the state constraint at the any time k + 1 satisfy X k+1 ⊆X k+1 .
Theorem 1 (Convergence): If u 0 (x k ) ∈ U k is such that the relaxed barrier function B k+l (x k+l ) is finite 1 ∀l ∈ N L 1 , and the value functionJ 0 (x k ) ≥r(x k , u 0 (x k ))+γJ 0 (x k+1 ); then with (15), it holds that Proof. Please refer to Appendix A. □ Remark 5: Let J p (x k ) be a safe and optimal value of (14) with µ = 0 and (13) under the control and state constraints. ThenJ * (x k ) is a good approximation of J p (x k ) given µ being small. Moreover, if the optimization problem with (3) under (13) is dual feasible, then one can obtain a quantifiable conditionJ is a function that decreases along with µ (see Page 566 in [46]). This implies that the control policy u * (x k ), associated withJ * (x k ), is L-step safe provided µ being chosen suitably small. Remark 6: Theorem 1 also implies that, at any time instant k, the initial control policy is not necessarily L-step safe to guarantee safety and convergence. Hence, the recursive feasibility under time-varying constraints is likely guaranteed as long as an L-step safe control policy exists provided any x k ∈ X k , which can be certified by Assumption 4.
Proposition 2 (Stability): Let γ = 1, x 0 ∈ X 0 , and u * be the control policy with the optimal solution v * , ρ * , and K * , solved via minimizing (12) with (11). Under Assumptions 3 and 4, the state x k of model (1) using u * , converges to the origin as k → +∞. Proof. Please refer to Appendix A. □ Remark 7: Note that the discounting factor 0 < γ ≤ 1 is a crucial ingredient in reinforcement learning to ensure the convergence of the value function and control policy to the optimal ones (see [49]). However, as illustrated by Proposition 2, a choice of γ close to 1 is suggested to guarantee closed-loop stability.

B. Safety and Robustness Guarantees in Disturbed Scenario
We show that our approach can guarantee safety and robustness under disturbances by properly shrinking the state constraints in the learning process. To this end, let the real model dynamics be given as where z k is the real state, w k ∈ W is an additive bounded and unknown disturbance which can represent the modeled uncertainty or measurement noise, W is a compact set containing origin in the interior. In case the model dynamics are unavailable, the derivation of the nominal model (1) may resort to data-driven modeling approaches. For a specific datadriven modeling approach and the estimation of the associated uncertainty set W please refer to [39]. Let at any time instant k, x k+j|k be the predicted state by applying the control u(x k ), · · · , u(x k+L−1 ) using model (1). Assuming that the uncertainty set W is norm-bounded, i.e., ∥w k ∥ ≤ ε w , then the following lemma is stated.
Lemma 2 ( [50]): The difference between the real state under u(z) and the nominal one under u(x) satisfies where The proof is similar to [50]. □ Let the constraint on the nominal state be shrink, i.e., The barrier function on the state in (10) is modified according to the constraint x k+j|k ∈X k+j . Assume that the computedX k+j is non-empty and contains the origin in the interior for all k ≥k.
Theorem 2 (Robustness): Under Assumptions 3-4, the state evolution of (21), by applying the learned optimal policy u * with (1), converges to the set D ∞ εw , i.e., lim k→+∞ x k → D ∞ εw . Proof. Please refer to Appendix A. □ As suggested in [50], to reduce the size of D j εw , i.e., the Lipschitz constant L f , two design choices are suggested: (i) a different suitable norm type can be used; (ii) an additional feedback term K(z k −x k ) can be added in the control input to reduce the conservativeness of the multi-step prediction of (1), where K ∈ R m×n is a stabilizing gain matrix of (1).

C. Convergence Analysis of BAC Learning Algorithm
Note that, as shown in the Proposition 1, the control problem for (1) withJ(x k ) is equivalent to an unconstrained problem for a time-varying model (11) with the (11). The convergence analysis for the BAC learning algorithm in this scenario would be much involved by Lyapunov method since the optimal weights of the actor and critic are time-dependent due to y k = (x k , B k (x k )). For the sake of simplicity, we recall that a time-varying constraint can be partitioned into several segments of time-invariant ones. Hence, in the following, we prove the convergence of the BAC learning algorithm under time-invariant state and control constraints, i.e., X = X k and U = U k . That is, we prove that whenever the constraints are changed, our algorithm can eventually converge after some time steps. To this end, one first writē where W * c and W * a are constant weights, κ c and κ a are reconstruction errors. In view of the universal capability of neural networks with one hidden-layer, we introduce the following assumption on the actor and critic network.
Assumption 5 (Weights and reconstruction errors of BAC): To state the following theorem in a compact form, we letW ⋆ = W * ⋆ − W ⋆ , ⋆ = a, c in turns, denote ∆h c,k = ∆h ⊤ c,k ∆h c,k , where ∆h c,k = γ L h c,k+L − h c,k , and use q and q + to denote q k and q k+1 respectively unless otherwise specified. For simplicity, we assume that Theorem 3 (Convergence of BAC learning): Under Assumptions 2 and 5, if where d m = 4γ a (σ 2 a,m + B 2 v,m + B 2 x,m ), and where q 1 , q 2 > 0, then it holds that ∥(ξ a,k ,W c,k )∥ ≤ ϵm λmin(S) , as k → +∞, where ξ a,k =W ⊤ a,k h a (x k ), ϵ m is a bounded error and S is a positive-definite matrix, whose definitions are deferred in Appendix A. Also, (ξ a,k ,W c,k ) → 0, as k → +∞, if κ ⋆ (x k ) → 0, ⋆ = a, c in turns. Proof. Please refer to the Appendix. □

V. SIMULATION AND EXPERIMENTAL RESULTS
The developed theoretical results are first verified with two robot simulated examples. Please see Appendix C for the detailed implementation steps and results. In this section, we focus on the applications of our approach to two real-world intelligent vehicles. Specifically, an integrated path following and collision avoidance problem is considered, which represents a crucial capability for navigation of intelligent vehicles under unknown and dynamic environments [51], [52]. A. Application to a Differential-Drive Vehicle: Offline Learning Scenario Consider the integrated path following and collision avoidance problem of a differential-drive vehicle. Its kinematics model isq where (p x , p y ) is the coordinate of the vehicle in Cartesian frame, θ is the yaw angle, u = [v o , ω] ⊤ is the control input, where v o and ω are the linear velocity and yaw rate, respectively. Let us define the path following error as e = q r − q, where q r is the reference state. One can write the error model as where (e x , e y , e θ ) =: e, v o,r and ω r are the reference inputs.
In implementation, model (25) was discretized with a sampling interval ∆t = 0.05s to derive the model like (1). The constraint for collision avoidance was typically formulated as where d and c k are the radius and center of the obstacle respectively. Also, the size of X k was properly shrunk by increasing d to account for uncertainties. In the training, the penalty matrices were selected as Q = I, R = 0.1, µ = 0.001. The discounting factor γ was γ = 0.95. The relaxing factor κ b was κ b = 0.05. The basis functions σ c (x) and σ a (x) were chosen as hyperbolic tangent activation functions with N c = N u = 4. The step L was chosen as L = 10. Weights W c and W a were initialized with uniformly random numbers.
Simulation results using Safety Gym environment [53]. We tested our approach in the Safety Gym environment with the MoJoCo simulator [54]. Our method was compared with some of the state-of-the-art safe RL algorithms: constrained policy optimization (CPO) [55], trust region policy optimization with Lagrangian methods (TRPO-L) [53], proximal policy optimization with Lagrangian methods (PPO-L) [53], deep deterministic policy gradient [56] with cost shaping (DDPG-CS), and soft actor-critic (SAC) [3] with cost shaping (SAC-CS). Note that the safety-aware RL [28] was not directly applicable in this case since it is nontrivial to find an invertible barrier function of the obstacle constraint. In the training stage, all the parameter settings of CPO, TRPO-L, and PPO-L were consistent with that in [53]. In DDPG-CS and SAC-CS, we used the same cost function as ours, i.e., (10). We directly deployed the offline learned control policy using the kinematics model in implementation since we did not know the vehicle's dynamic model. All the comparative algorithms were trained and deployed using the same environment in Safety Gym. The simulation results in Table I show that our approach outperforms all the comparative algorithms in data efficiency, collision avoidance, and performance (see the video details 2 ).
Remark 8: There are failures under our offline learned policy without MPE for the following reasons. First, the obstacles' locations were generated randomly without considering the vehicle's physical limit. Second, the estimated uncertainty is inaccurate since the inputs in Safety Gym are saturated values with unknown physical interpretations. These issues lead to Theorem 2 being not verified. In our case, one can guarantee safety (possibly with conservativeness) by activating our MPE mechanism even if the model is inaccurate, see Table I and II. As shown in Table II, when the obstacles overlapped with the reference path between the target and vehicle, our approach offers a significant performance improvement compared with other adopted approaches 2 . The DDPG-CS and SAC-CS failed to obtain the converged policy after several training trials. Therefore we are unable to show the results in the Tables. In summary, our approach outperforms the comparative modelfree safe RL approaches for the following two reasons. Firstly, the proposed barrier force-inspired control policy structure has a clear physical interpretation to guarantee safety online and improve the generalization ability. Secondly, our approach is model-based, facilitating multi-step policy evaluation online.
Real-world experimental results with comparisons to nonlinear MPC algorithms. We also tested our proposed algorithm on a real-world differential-drive vehicle platform. The control task is to follow a predefined reference path (with v o,r = 0.7 m/s) while passing and avoiding collision with a moving object (vehicle) that is traveling along the reference path. In such a situation, the conflict between the goals of path following and collision avoidance leads to a challenging multi-objective control problem.
In the experiment, the vehicle was equipped with a Laptop running Ubuntu in an Intel i7-8550U CPU@1.80 GHz. The sampling interval was set as ∆t = 0.1s. We directly deployed the offline learned policy of our approach to control the vehicle. At each sampling instant, the onboard laptop computed the control input in real-time using the state information, which was periodically measured by the onboard satellite inertial  following MPC algorithms were adopted for comparison.
1) A nonlinear MPC algorithm with nonconvex circular constraints (NMPC-c), where the constraint was designed according to [57]. The vehicle obstacle was approximated by a circle, i.e., we enforce constraint (∆p x ) 2 +(∆p y ) 2 > d 2 o in NMPC-c, where ∆p x and ∆p y were deviations from the robot to the obstacles in the associated coordinate axes, d o = 1m.
2) A nonlinear MPC algorithm with nonconvex ellipsoidal constraints according to [58]. The vehicle obstacle was approximated by an ellipsoid, where the semi-major axis of the ellipsoid was in the direction of the reference path. The semi-major radius and semi-minor radius were computed as 1.517m and 1.017m respectively according to [58]. 3) A nonlinear MPC algorithm with control barrier function [59] (NMPC-cbf). The collision avoidance constraint is formulated by a control barrier function constraint, i.e., h(k o is a control barrier function, the parameter η is a positive scalar which is properly tuned for fair comparisons (see Table IV).
The stage costs of all the comparative MPC algorithms were designed the same as in (4), and the prediction horizon was set as N p = 20. According to [58], the following potential function was additionally adopted to improve the collision avoidance performance in NMPC-c and NMPC-e, ϵ p was chosen as 0.0001, and µ p was tuned for fair comparisons (see Table IV). All the MPC algorithms were solved at each sampling interval based on the CasADi toolbox [60] with an Ipopt solver [61]. All the algorithms were tested under different reference profiles. Experimental results under dynamic collision avoidance were illustrated in Table IV and Figs. 4-8 in Appendix B (see the video details 3 ). The results show that the NMPC-c and NMPC-e failed in realizing overtaking and followed behind the moving obstacle when the adopted reference points were dense, while our approach can realize conflict resolution in all scenarios. Also, our approach outperforms NMPC-c and NMPC-e in terms of the planning performance and path following performance (see Table IV). In addition to the unique policy design and learning mechanism of our approach, the performance improvement to the MPC algorithms is also due to the significant computational load reduction (see Table III, and Fig. 10 in Appendix B). To further show the effectiveness of our approach, we carried out extra tests by manually manipulating the moving obstacle to block the path of the ego vehicle when the latter reacted promptly to avoid collision successfully (see Fig. 9 in Appendix B).

B. Application of an Ackermann-Drive Vehicle: Online Learning Scenario
Consider the path following control of an Ackermanndrive vehicle with collision avoidance. Its simplified lateral dynamics is described by a "bicycle" model (cf. [62]), that iṡ where X and Y are the coordinates of the vehicle center of mass in the Cartesian frame XoY , v x and v y are the longitudinal and lateral velocities respectively, φ is the yaw angle, I z = 4175kg · m 2 is the yaw moment of inertia, m = 1723kg is the mass of the vehicle, C f = 66900N and C r = 62700N are the cornering stiffness of the front and rear tires, respectively, l f = 1.322m, l r = 1.468m, δ is the front wheel angle variable to be manipulated. Given path reference points (X r , Y r ) and v x , we aim to minimize the lateral distance from the vehicle center of mass to the nearest reference point while avoiding potential collisions with obstacles. To this end, let the nearest point be (X r p , Y r p ), then one can compute the reference yaw angle φ r p . Define e y = −(X − X r p )sin(φ r p ) + (Y − Y r p )cos(φ r p ), e φ = φ − φ r p . Let x = (e y ,ė y , e φ ,ė φ ), then one can obtain the continuoustime lateral dynamical model: Since (x, δ) = 0 might not be an equilibrium point if φ r p ̸ = 0, we introduced a virtual control variable u = δ + δ f , where δ f was selected such that F 2 (x)δ f = F 3 (x)φ r p . Consequently, the lateral dynamical model was discretized with a sampling interval ∆t = 0.02s, i.e., x k+1 = x k + ∆tF 1 (x k ) + ∆tF 2 (x k )u k . In the path following control task with collision avoidance, the cost function was chosen asJ = Real-world experimental results 2 . We also tested our safe RL algorithm on the real-world intelligent vehicle platform built with a HongQi EHS3 electric car to realize the path following control (see Fig. 3). In the experiment, the states of the vehicle were measured by a SIGIPS; then, the measured states were transmitted to an industrial control computer, where the control policy was computed using our approach with a sampling interval of 0.02s. We first applied our algorithm to follow a reference path with road boundaries (see Fig. 3-B). Different from that in the simulation tests, the vehicle speed was controlled by a PI controller to track a timevarying speed reference. This caused a strong nonlinearity of the lateral dynamics, leading to extra difficulties in the control task. The experimental results displayed in Fig. 3 show that the control policy of our approach can be learned offline and deployed online safely, showing an impressive sim-to-real transfer capability. Also, one can achieve better control performance by online learning the control policy, which further demonstrates the adaptability of our approach to dynamic environments.
To show the capability of dealing with time-varying state constraints, we tested our approach to track a reference path that overlapped with obstacles (see Fig. 3-C). Similarly, the location information of obstacles was assumed to be predetected. In the experiment, the control policy was learned and deployed synchronously online. The initial constraints were the road boundaries. Then, the constraint on e y was changed accordingly once the vehicle was near the obstacle. The vehicle using our approach can avoid collision successfully and converge rapidly to the reference path after completing the collision avoidance task (see again Fig. 3).

C. Implementation Issues and Discussions
Implementation issues. First, the tuning parameter µ is suggested to be chosen smaller than the entries of Q and R to obtain a satisfactory control performance. A larger choice of µ might result in a safe but conservative control policy. Second, the initial values of W a,σ ,K, andρ in the actor must be properly selected such that the initial control policy with (19) is L-step safe, which is a prior condition in Theorem 1. Finally, the relaxing factor κ b in Lemma 1 must also be selected properly. A smaller choice is suggested if a less conservative control policy is expected, while a larger choice can be made to ensure absolute control safety.
Discussions. As a prominent feature, our approach can learn an explicit control policy offline and deploy it to a different control scenario even if the concerned constraints are nonlinear and nonconvex. However, in MPC, the control action must be computed online by periodically solving an optimization problem [39], which can be difficult for the on-the-fly implementation under nonlinear and nonconvex constraints, see Section V-A. As shown in the simulation and real-world experiments, our learned policy using an inaccurate model shows an impressive sim-to-real transfer capability compared with state-of-the-art model-free RL approaches. In the experiments of differential-drive vehicles, our approach outperforms comparative MPC algorithms under measurement noises and modeling uncertainties. Indeed, our approach is a step forward in applying safe RL to the real-world intelligent vehicle control problem.

VI. CONCLUSIONS
This paper proposed a safe RL algorithm with a barrierbased control policy structure and a multi-step policy evaluation mechanism for the optimal control of discrete-time nonlinear systems with time-varying safety constraints. Under certain conditions, safety can be guaranteed by our approach in both online and offline learning cases. Our approach can solve continuous control tasks in the environment with abrupt changes, both online and offline. The convergence and robustness of our safe RL under nominal and disturbed scenarios were proven, respectively. The convergence condition of the barrier-based actor-critic learning algorithm was obtained.
Besides numerical simulations for theoretical verification, we tested our approach in two real-world intelligent vehicle platforms. The simulation and real-world experiment results illustrate that our method outperforms state-of-the-art safe RL approaches in control safety, and shows an impressive simto-real transfer capability and a satisfactory real-world online learning performance. In general, the proposed safe RL is a step forward in applying safe RL to the optimal control of real-world nonlinear physical systems with time-varying safety constraints. Future works will consider the extension to modelfree safe RL with theoretical guarantees.
(i) First note that, . Then one can obtain the result by induction.
is a semidefinite positive function in view of the property of the barrier function. Then one can conclude thatJ i (x k ) converges to a value denoted asJ ∞ (x k ) ≥ 0. From Claim 1), one has One can promptly conclude that,J ∞ (x k ) =J * (x k ). And v ∞ , ρ ∞ , and K ∞ equal to the optimal values v * (x k ), ρ * , and K * , respectively. Consequently, □ 2) Proof of Proposition 2 In view of (8) and (3), one can observe that u k = 0 for k ≥k provided that v k = 0 and x k = 0. Also, B k (u k ) = 0 as u k = 0 and s k = 0 as x k = 0 for k ≥k. Hence, J * (x k ) < +∞ under u * (x k ) with v * (x k ), ρ * , K * . As a result, y k , u k → 0 as k → +∞ under policy u * (x k ). Consequently, x k → 0 as k → +∞.
□ 3) Proof of Theorem 2 (i) Offline learning scenario. In view of the Lipschitz continuity condition (2), the difference of the real state z under u * (z) and the nominal state x under u * (x) can be computed as ∥z 1 − x 1|0 ∥ = ∥w 0 ∥ ≤ ϵ w , since x 0|0 = z 0 . Then, by induction, one has Hence, the real state z k converges to D ∞ εw as k → +∞. (ii) Online learning scenario. Given an offline learned control policy, at any time instant k, it is learned to obtain an improved control performance under the constraint x k+j|k ∈X k+j . In this case, the control policy can not be updated if the learned one is evaluated (by MPE) to be inferior to the current one. Given the proof in the offline learning case, the real state of the online learned policy converges to the set D ∞ εw . □

4) Proof of Theorem 3 Define a Lyapunov function
where α c > 1. In view of the update rule (18) and (20), one can write the difference of V ⋆,k (⋆ = a, c in turns), i.e., To first compute ∆V c , note that Moreover, where ∆κ c,k =γ L κ c,k+L − κ c,k , the second equality in (31) is due to the Bellman equation (14).
Since ϵ t is bounded, let ϵ m be the lower bound of ϵ t ; then in view of Assumption 5, with S ≻ 0, it follows that ∆V ≤ 0 for all ∥(ξ a,k ,W c,k )∥ ≥ ϵm λmin(S) . Consequently, ∥(ξ a,k ,W c,k )∥ → 0, as k → +∞, provided that ϵ t → 0. □ Fig. 4. The experimental results on path-following and collision avoidance by NMPC-c (µp = 5 · 10 −3 ): the shorter line with a gray shade represents the route of the moving vehicle, while the longer line represents the trajectory of the ego vehicle. When using dense reference points (dr = 0.07m), the ego vehicle was unable to pass, but was successful when using sparse reference points. In the latter scenario, the collision avoidance process caused the ego vehicle to experience a short transient period of rapid speed variation.  5. The experimental results on path-following and collision avoidance by NMPC-e (µp = 5 · 10 −4 ): the shorter line with a gray shade represents the route of the moving vehicle, while the longer line represents the trajectory of the ego vehicle. When using dense reference points (dr = 0.07m), the ego vehicle was unable to pass, but was successful when using sparse reference points. In the latter scenario, the collision avoidance process caused the ego vehicle to experience a short transient period of rapid speed variation. Fig. 6. The experimental results on path-following and collision avoidance by NMPC-cbf (η = 0.5): the shorter line with a gray shade represents the route of the moving vehicle, while the longer line represents the trajectory of the ego vehicle. When using dense reference points (dr = 0.07m), the ego vehicle was unable to pass, but was successful when using sparse reference points. In the latter scenario, the collision avoidance process caused the ego vehicle to experience a short transient period of rapid speed variation. Fig. 7. The experimental results on path-following and collision avoidance by NMPC-cbf (η = 1): the shorter line with a gray shade represents the route of the moving vehicle, while the longer line represents the trajectory of the ego vehicle. When using dense reference points (dr = 0.07m), the ego vehicle was unable to pass, but was successful when using sparse reference points. In the latter scenario, the collision avoidance process caused the ego vehicle to experience a short transient period of rapid speed variation.  Fig. 10. The CPU running time comparison in C ++ . In many time instants, the computational time values of the NMPC-c, NMPC-e, NMPC-cbf are greater than the adopted sampling interval, i.e., 0.1s, which could hamper the control performance (see Table IV), while the computational time value of our approach is much smaller and its influence on the control performance can be negligible.

B. Auxiliary Real-World Experimental Results on the Differential-Drive Vehicle
Figs. 4-8 present the experimental results of NMPC-c, NMPC-e, NMPC-e, and our approach to path following with dynamic collision avoidance. Fig. 9 presents the experimental results of our approach in a noncooperative dynamic collision avoidance scenario, and Fig. 10 shows the significant advantage of our approach in computational load reduction in the C ++ environment.

C. Simulation on Regulation of a Mass-Point Robot
Consider the regulation control of a mass-point robot. Its discrete-time model is described by x k+1 = Ax k +Bu k where  In the simulation process, the penalty matrices were selected as Q = I, R = 0.1, µ = 0.001. The discounting factor γ was γ = 0.95. The step L was chosen as L = 10. The entries of weighting matrices W c and W a were initialized with uniformly random numbers. The initial state was x = (−0.5, −0.5). The state and control were initially limited by −1 ≤ x 1 , x 2 ≤ 0.5, −1 ≤ u ≤ 0.3. Then at k = 285, x was reset as x = (−0.65, −0.65) and the constraints were changed as −0.5 ≤ x 1 , x 2 ≤ 0.3, −0.5 ≤ u ≤ 0.1.
The control performances are compared in Fig. 11, which shows that all approaches can converge to the origin. In contrast, our approach and safety-aware RL [28] can ensure safety constraint satisfaction in the control process. Moreover, comparisons in terms of safety for 500 repetitive experimental tests are listed in Table V, which illustrates that our approach can ensure safety for all the performed tests, while it did not hold so for safety-aware RL in [28] under time-varying safety constraints. The reason behind this is the adopted multistep policy evaluation mechanism in our approach. From the box-plot in the right-bottom panel of Fig. 11, one can see that the mean values of J in all approaches are comparable and its standard cost deviation in our approach is smaller than that in [28], [44], and [63]. The results reveal that the performance of our approach is more stable under state and control constraints due to the proposed barrier-based control policy design.

D. Simulation on Regulation of Van Der Pol Oscillator
Consider the regulation control of Van der Pol oscillator [39]. Its discrete-time model is given as x 1,k+1 = x 1,k + ∆tx 2,k x 2,k+1 = x 2,k + ∆t(x 2,k − x 2 1,k x 2,k − x 1,k + u k ) where x 1 and x 2 are the states, and u is the control variable, ∆t = 0.01s. In the simulation process, the penalty matrices were selected as Q = I, R = 0.1, µ = 0.01. The discounting factor γ was γ = 0.95. The step L was chosen as L = 10. The entries of weighting matrices W c and W a were initialized using saturated uniformly random numbers such that Theorem 1 was fulfilled. Starting with an initial condition x 0 = (−0.5, −0.5), the training was performed under time-varying state constraints, see Fig. 12. We compared our approach with two classic control methods, i.e., heuristic dynamic programming (HDP) [63], multi-step heuristic dynamic programming (MsHDP) [44] and three safe RL approaches, e.g., safety-aware RL [28], DDPG-CS and SAC-CS. The control parameters of the proposed safe RL and the comparative approaches in [28], [44], and [63] were set similarly. In DDPG-CS and SAC-CS, the cost function was reshaped with the same barrier functions adopted in the paper, and all the training parameters are fine-tuned according to [56] and [3] respectively. The simulation results in Fig. 12 and Table VI show that our approach can cope with time-varying state constraints in the control (learning) process while safetyaware RL in [28], DDPG-CS, and SAC-CS could fail due to the sudden change of constraints. The reason behind this is that the adopted safe RL approaches could not predict future changes of safety constraints and inform how to achieve safety by the actor-critic structure, hence prone to failing in abruptly changed environments. MsHDP in [44] and HDP in [63] can not guarantee safety constraint satisfaction. Moreover, our approach converged faster than that in [28], [44], and [63] (see Fig. 12).

MsHDP [45] Ours
HDP [63] Safety-aware RL [28] Fig. 11. Simulation results for control of mass-point robot: The state, control, and cost comparisons between our approach and algorithms in [28], [44], and [63]. Simulation results for control of Van der Pol oscillator: The state variables (left panel) and the stage cost (right panel) with our approach and adopted comparative algorithms. In each training of safety-aware RL [28], DDPG-CS, and SAC-CS, the control (learning) safety was not fulfilled, and the learning process was terminated when the size of state constraint was suddenly reduced. However, our approach can adapt to the constraint variation due to the adopted control policy structure and multi-step policy evaluation mechanism.