Fear-Neuro-Inspired Reinforcement Learning for Safe Autonomous Driving

Ensuring safety and achieving human-level driving performance remain challenges for autonomous vehicles, especially in safety-critical situations. As a key component of artificial intelligence, reinforcement learning is promising and has shown great potential in many complex tasks; however, its lack of safety guarantees limits its real-world applicability. Hence, further advancing reinforcement learning, especially from the safety perspective, is of great importance for autonomous driving. As revealed by cognitive neuroscientists, the amygdala of the brain can elicit defensive responses against threats or hazards, which is crucial for survival in and adaptation to risky environments. Drawing inspiration from this scientific discovery, we present a fear-neuro-inspired reinforcement learning framework to realize safe autonomous driving through modeling the amygdala functionality. This new technique facilitates an agent to learn defensive behaviors and achieve safe decision making with fewer safety violations. Through experimental tests, we show that the proposed approach enables the autonomous driving agent to attain state-of-the-art performance compared to the baseline agents and perform comparably to 30 certified human drivers, across various safety-critical scenarios. The results demonstrate the feasibility and effectiveness of our framework while also shedding light on the crucial role of simulating the amygdala function in the application of reinforcement learning to safety-critical autonomous driving domains.


I. INTRODUCTION
A UTONOMOUS driving has attracted considerable atten- tion from both academia and industry across the globe in recent years.The societal benefits of this paradigm are expected to include safer transportation, reduced congestion and lower emissions.However, the safety aspect of autonomous driving is still a major concern for large-scale deployment.Many realworld scenarios contains inevitable nonstationarity and uncertainty, which may lead autonomous vehicles to exhibit undesirable and unsafe driving behaviors and might even cause fatal casualties.To deal with these potential risks, there is still a long way to go to meet the strict requirements and high expectations with regard to the deployment of autonomous driving in society.
Modern artificial intelligence (AI) technologies have made numerous accomplishments [1], [2], [3], [4], exerting a strong impetus on the advancement of autonomous driving [5], [6].Noticeably, reinforcement learning (RL) has emerged as a prominent field within AI, demonstrating remarkable achievements across various challenging decision tasks, such as Go [7], Star-Craft [8], and autonomous racing [9].Consequently, researchers have attempted to explore various RL algorithms along with their applications in autonomous driving [10].Although existing approaches have achieved many compelling results, the lack of safety guarantees limits the applicability of RL in safety-critical autonomous driving domains.In light of this concern, many researchers have made efforts to study safe RL methods for ensuring the safety of autonomous vehicles.A common paradigm is to combine traditional RL algorithms with safety checkers [11] or constraints [12] to optimize driving policies while guaranteeing or encouraging safety.Yet it is inevitable that the agent will encounter numerous hazardous situations before it can effectively learn to avoid safety violations, even with the integration of sophisticated techniques to minimize the likelihood of failures.
Recently, some researchers have advocated for increased research efforts in "NeuroAI" since it holds the promising potential to catalyze the advancement of next-generation AI technologies [13].RL theory is derived from the neuroscientific and psychological perspectives on organism behavior [14].A common assumption regarding RL from the brain science perspective is that the dopamine neurons in the midbrain code for reward prediction errors, which enable the striatum to learn rewarding behaviors [15].Most existing computational RL frameworks can be represented with this mechanism [16].However, in recent years, many neuroscientists have argued that the amygdala plays a central role in the RL function of the brain, perhaps a more important role than the striatum but certainly a more important role than is attributed to it in current RL frameworks [15], [16].The amygdala fear circuit in the brain can predict dangers and elicit defensive behavioral responses against threats and harms; this is crucial for survival in and adaptation to potential risky environments [17].Amygdala lesions inhibit the fear learning and avoidance behavior elicited by threats.Moreover, some studies in neuroscience and psychology have highlighted the necessity of actively forecasting hazards or contingencies via world models to ensure the survival of organisms [17].
Consequently, motivated by the aforementioned insights, in this work, we hope to establish linkages between AI, neuroscience and psychology and explore a novel RL framework by modelling the amygdala functionality of the brain to further advance safe decision making for autonomous vehicles.More specifically, building upon the current computational framework for the dopamine-striatum mechanism, we present a fear-neuroinspired RL (FNI-RL) technique to model the process of RL in the brain by considering the amygdala functionality, enabling the autonomous driving agent to learn defensive behaviors effectively.We encourage the agent to undertake risky explorations within its own imagination through a model-based setting, while executing safe decisions during interactions with the real environment to the greatest extent possible.
An overview of the proposed approach is illustrated in Fig. 1.In light of the RL-related functional systems in the brain, we first present an adversarial imagination mechanism to simulate safety-critical situations with a learnable adversary and world model, facilitating the agent to cope with unseen hazardous scenarios and enhance policy robustness against uncertainties and nonstationarities.Concretely, we leverage a mixed policy comprising both the agent and the adversary to interact with the learned world model, where the agent seeks to keep its fear within specified bounds while the adversary aims to maximize the agent's fear.Here a fear model is constructed to estimate the fear of the agent in response to the recognition of dangers or contingencies.Based on the findings in neuroscience [17], [18], our fear model incorporates both negative stimuli (e.g., safety violations) and environmental uncertainties.Additionally, we develop a fear-constrained actor-critic (FC-AC) algorithm that enables the agent to learn defensive driving behaviors and ensure safe decision making, via effectively assessing unsafe policy trajectories and adhering to the imposed fear constraints.
Compared with existing studies, the main contributions of this work are summarized as follows. 1 (1) Drawing inspiration from the fear neurons in the brain, we present a computational FNI-RL framework to enhance the safety of autonomous vehicles.(2) An adversarial imagination technique is advanced to simulate safety-critical situations, which facilitates the agent to tackle unseen risky scenarios and improve the policy robustness against uncertainties and nonstationarities.Here a fear model is devised to recognize and estimate dangers and contingencies.(3) An FC-AC algorithm is developed to enable the agent to learn defensive driving behaviors and realize safe decision making with fewer safety violations.
We demonstrate the feasibility and effectiveness of the proposed FNI-RL approach for safe autonomous driving in comparison with state-of-the-art AI agents and 30 certified human drivers.The simulation tests are performed based on the simulation of urban mobility (SUMO) package [19].In addition, experimental evaluations are also carried out in three critical situations on a human-in-the-loop test platform (Fig. 4(b)) with a highfidelity driving simulator, Car Learning to Act (CARLA) [20].The results indicate that, enhanced by the developed FNI-RL algorithm, the autonomous driving agent can generate defensive decision making behaviors, thereby significantly improving safety and achieving human drivers' performance in various safety-critical scenarios.
In [27], a scheme called AdvSim is presented for generating safety-critical scenarios.AdvSim optimizes the vehicle trajectories jointly to perturb the driving paths of surrounding vehicles.Moreover, incorporating AdvSim-generated safetycritical scenarios in training can benefit the safety of autonomous vehicles.In [28], a technique named STRIVE is introduced, which utilizes a graph-based conditional variational autoencoder (CVAE) model to automatically generate challenging scenarios.Here the scenarios generated by STRIVE can be employed to to optimize hyperparameters of a rule-based planner.In [29], a gradient-based scenario generation method called KING is proposed, which utilizes a kinematic motion model to guide the generation of adversarial scenarios.Additionally, the safety of autonomous driving can be enhanced by augmenting the training data with the generated scenarios from KING.However, these methods rely on pre-collected datasets to learn traffic priors.Furthermore, they do not optimize driving policies by integrating generated safety-critical scenarios with RL.In [30], a causal generative model is devised to generate safety-critical scenarios through causal graphs derived from human priors.The authors also empirically demonstrate that incorporating the generated scenarios as additional training samples can enhance the performance of RL-based driving policies.Nevertheless, this technique depends heavily on human priors.In contrast, our FNI-RL approach for learning safe autonomous driving policies does not rely on any pre-collected datasets or human priors.In addition, unlike the aforementioned methods, FNI-RL optimizes both driving policies and the adversarial sample generation module simultaneously in an online learning manner, as the RL agent interacts with the real environment.
An imitation learning (IL) technique with on-policy RL supervisions is developed to enhance the performance of autonomous vehicles in [31].A human-in-the-loop learning scheme called human-AI copilot optimization is advanced to facilitate the learning of safe driving policies in [32].This approach integrates interventions from human experts into the interaction between the agent and the environment to guarantee both efficient and safe exploration.Furthermore, some researchers have employed RL methods with safety constraints based on prior knowledge [33] or rules [34] to optimize driving policies while simultaneously guaranteeing the satisfaction of the imposed constraints.In [35], the authors present a constrained adversarial RL algorithm that aims to realize safe autonomous driving from the perspective of robust decision making.While these approaches can effectively improve the safety of autonomous vehicles, they either heavily rely on pre-collected datasets or human priors, or they have to go through a substantial number of safety violations to learn safe driving policies.In contrast, the proposed FNI-RL approach allows the agent to acquire safe driving skills with fewer safety violations, without the requirement for pre-collected datasets or human priors.

B. Safe Model-Free Reinforcement Learning
A popular class of safe model-free RL (SMFRL) methods is dedicated to solving the constrained Markov decision process (CMDP) to ensure the acquisition of safe policies [36].These studies extensively combine model-free RL framework with Lagrangian methods to restrict the cost value of the policy below a predetermined threshold [37].In the latter case, the policies and Lagrangian multipliers are optimized iteratively via the dual theory [38].There are also SMFRL algorithms that incorporate reachability analysis [39], [40] or expert information [41], [42].For instance, in [41], a SMFRL framework with prior knowledge is developed to ensure safe exploration.Although the above methods have achieved many competitive results, they either suffer from a large number of unsafe interactions during training or heavily depend on human priors.In contrast, FNI-RL does not require any prior knowledge and enables the agent to learn safe driving skills with fewer safety violations.

C. Safe Model-Based Reinforcement Learning
In safe model-based RL (SMBRL), apart from learning a policy model, an additional environment model is required to be learned, which can be leveraged to generate possible trajectories or evaluate the safety of actions before executing them in the real environment [43], [44], [45].By incorporating cost constraints throughout the learning process, SMBRL methods have the potential to prevent dangerous exploration behaviors while ensuring sample efficiency [46], [47], [48].For example, in [45], a SMBRL scheme is proposed to minimize safety violations during training.This method involves learning an ensemble of probabilistic dynamics models to plan ahead a short time into the future and applies heavy penalties to unsafe trajectories.In [47], a SMBRL technique is introduced to cope with safetycritical tasks, which adopts the learned Bayesian world model to generate trajectories and estimate an optimistic bound for the task objective and pessimistic bounds for the constraints.Then, the augmented Lagrangian approach is employed to solve the constrained optimization problem with the estimated bounds.In [48], a SMBRL algorithm is developed with a Lagrangian relaxation-based proximal policy optimization technique and an ensemble of environment model.In this framework, both epistemic and aleatoric uncertainties are simultaneously taken into account during the learning of the dynamics models.Unlike the methods mentioned above, drawing inspiration from the fear neurons in the brain, FNI-RL incorporates the adversarial imagination technique that can simulate safety-critical situations Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
via the learned adversary and world model, assisting the agent in handling unseen risky scenarios and enhancing policy robustness against uncertainties and nonstationarities.Additionally, in FNI-RL, the agent is required to comply with the fear constraint that encompasses the dangers and uncertainties estimated by the adversarial imagination.

III. METHODOLOGY
The proposed FNI-RL framework for the safe decision making of autonomous vehicles is mainly composed of the adversarial imagination technique and the FC-AC algorithm.The framework of our approach is illustrated in Fig. 1.

A. Adversarial Imagination
We develop the adversarial imagination technique by combining the adversarial agent with the world model to simulate the worst-case situations in the imagination, enabling our autonomous driving agent to tackle unseen critical scenarios and improve policy robustness.Here a mixed policy π mix (•) is defined as: where α is a weight between 0 and 1, π(•) and π(•) represent the stochastic policies of the protagonist and the adversary, θ and θ are the parameters of the policy network and the adversarial policy network, and s denotes the state of the agent, respectively.An action perturbed by the adversary, denoted as ã, can be sampled from the mixed policy, i.e., ã ∼ π mix (•|s).The protagonist endeavors to optimize the expected return while ensuring that its fear remains within predefined bounds.Conversely, the adversary aims to maximize the protagonist's fear.
In organisms, fear can be elicited by certain negative stimuli [17].For instance, watching or experiencing a frightening traumatic accident is capable of arousing fear in humans.In RL, the reward function serves as an incentive used to evaluate the behaviors of the agent.Similarly, in constrained RL [36], we can view the cost function as a form of negative stimulus, such as collisions.Furthermore, fear can also be caused by uncertainties [49], [50].For example, a human being may feel fear in an uncertain environment.Consequently, we construct the fear model to incorporate both the anticipated negative stimuli and epistemic uncertainties simultaneously, and it can be expressed as follows: where β represents a weight that ranges from 0 to 1. ĉ(•) and σ(•) denote the cost function and epistemic uncertainty estimated via the world model, respectively.From (2), the higher estimated cost and uncertainty will arouse a more intense fear in the agent.f and f denote the lower and upper bounds of the fear, respectively.In our setting, we utilize the probability of safety violations as the cost function, i.e., ĉ(•) ∈ [0, 1].Moreover, the minimum of σ(•) is equal to zero.We constrain the maximum of σ(•) as 1.Consequently, we can draw the following conclusion: The world model aims to provide an internal representation of the contingencies of the real environment.Here, we leverage an ensemble of diagonal Gaussian world models to effectively acquire both aleatoric and epistemic uncertainties [45], [51].This ensemble can be denoted as { Tφ k } K k=1 , where Tφ k (s , c|s, a) = N (μ φ k (s, a), σ 2 φ k (s, a)).s and K are the next state and the number of the world models, respectively.Moreover, μ φ k (•) and σ φ k (•) represent the mean and standard deviation of the Gaussian distribution N (•) parameterized by φ k , respectively.In contrast to the majority of existing environmental models, our world model predicts a cost c rather than a reward r.For the kth world model, it can be trained by minimizing the following objective function based on negative log-likelihood: where M denotes an experience replay memory.Random differences in initialization and mini-batch paradigm during training give rise to distinct models.The model ensemble is able to be employed to produce predictions incorporating uncertainties.By combining the ensemble with the mixed policy, the set-valued cost and uncertainty can be obtained: where ŝ and ŝ represent the state and next state estimated by the world model, respectively.With a short prediction horizon m, the fear of the agent can be denoted as: where ŝm and ãm represent the state and action obtained after m steps of forward planning based on the world model and mixed policy, respectively.We collect the generated virtual transitions into a virtual experience replay memory M, enhancing the performance of the agent.Additionally, the adversary model can be learned by maximizing the following objective function:

B. Fear-Constrained Actor-Critic
In this section, the proposed FC-AC algorithm is introduced to optimize the driving policies of our agent while keeping its fear within preset bounds.
A CMDP is an augmentation of a Markov Decision Process (MDP) by incorporating a cost function, which can be represented by a 6-tuple S, A, p, r, c, γ .S is the set of states called the state space.A is the set of actions called the action space.p is the transition probability distribution.r : S × A → R denotes the reward function, and c : S × A → R represents the cost function.γ ∈ (0, 1) is the discount factor.
According to CMDP, FC-AC seeks to solve the following constrained optimization problem: where t is the time step, and f 0 is a prescribed threshold.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
A policy iteration algorithm, named fear-constrained policy iteration (FC-PI), is developed to approximate the optimal policies.The FC-PI method comprises two learning processes: policy evaluation and policy improvement.These two processes are updated alternately until the policy converges.FC-PI can provably converge to the optimal policy (see the supplementary).Moreover, the Lagrangian of the constrained optimization problem can be written as: (8) where λ denotes the dual variable, and λ ≥ 0.
1) Fear-Constrained Policy Evaluation: The action-value function Q π (s, a) can be iteratively computed under the fixed policies of the agent via a Bellman backup operator T : where V π (•) denotes a value function, and it is designed as: The FC-AC algorithm employs two parameterized actionvalue functions with network parameters φ z , z ∈ {1, 2} to speed up the model training process [52].The parameters of the actionvalue function can be learned by minimizing the following loss function of the critic network: where y denotes a target value.According to the results in [53] and our empirical findings, the training of the action-value function network requires relatively high data quality.Therefore, we only employ real interaction data to train the action-value function network, reducing the reliance on the accuracy of the world model.
To ensure safety, it is imperative to guarantee that the Q-values of actions causing unsafe states are lower than the Q-values of safe actions.We follow the assumption regarding the existence of a special horizon H in [45].According to this assumption, after the agent completes H steps of safe interaction with the environment, it will inevitably transition into an unsafe state (i.e., with a safety violation).Then, the agent can no longer recover to the safe state (i.e., without a safety violation).
In theory, we can devise a specific cost c * as a penalty of the agent for safety violations to avoid the occurrence of the hazardous situation described in the above assumption.Under the given assumption, the maximum of the infinite-horizon discounted return with the agent's fear is as follows: where r is the upper bound of the reward r, and c * denotes the lower bound of the cost c * .In contrast, in the absence of any safety violations, the minimum of the infinite-horizon discounted return considering the fear is as follows: where r represents the the lower bound of the reward r.
To ensure a reasonable evaluation of the safety of decisions, it is desirable for the following inequality to hold: With ( 14), we can derive the following conclusion: Since f and f are bounded, and to satisfy the above inequality, we can design the cost c * as: To prevent overestimation in the action-value function, the minimum estimation among the two target parameterized actionvalue functions is leveraged to train the critic network.Hence, y can be devised as: The network parameters φz of the target action-value function can be updated by Polyak averaging: φz ← ρ φz + (1 − ρ)φ z , where ρ is a scale coefficient between 0 and 1.
2) Fear-Constrained Policy Improvement: In FC-PI, the policy improvement aims to maximize the expected return while adhering to the fear constraint.
According to Lagrange duality theory and ( 8), the Lagrange dual problem associated with the constrained optimization problem in (7) can be derived as: In order to effectively tackle unseen safety-critical scenarios and enhance the policy diversity, we optimize the policy of the agent using data from both virtual and real experience replay memories.Hence, the optimal policy of the agent can be approximated by maximizing the following objective function for the actor network: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Additionally, the dual variable λ can be updated by minimizing the following objective function: In our setting, the cost ĉ returned by the world model represents the probability of a safety violation.Hence, during the model testing phase, to further diminish the risk, the agent can assess the safety of decisions using the learned world model.For instance, in Fig. 1, if the agent's action is evaluated by the world model as having a high collision risk, then a Gaussian noise will be added to this action.

IV. RESULTS
To benchmark FNI-RL, we set up experimental comparisons with state-of-the-art AI agents and certified human drivers in complex and critical traffic scenes.

A. Baselines
Rule-Based Driver: An intelligent driver model (IDM) in SUMO is leveraged as a rule-based baseline.
Vanilla RL: We employ proximal policy optimization (PPO) [54] and soft actor-critic (SAC) [55] as two vanilla RL baselines, representing on-policy and off-policy methods.
IL: Generative adversarial imitation learning (GAIL) [56] and RL coach (Roach) [31] are employed as two IL baselines.We utilize the next generation simulation (NGSIM) dataset [57] along with the behavior cloning (BC) technique to train a policy model as the initial model for the two IL baselines.This ensures that the IL agents possess basic driving skills right from the start of the training phase.Furthermore, during the training process, the GAIL agent learn expert behaviors by leveraging the demonstration data from IDM.
Human Driver: We recruit 30 human participants for the experiments, all of whom hold valid driving licenses.

B. Metrics
To assess the overall driving quality, we introduce a driving score (DS) defined as follows: where SR is a success rate, v and v max denote the agent's speed and the permissible maximum speed.The weight η is set to 0.8.Successful driving here refers to the vehicle's ability to reach the target lane without any safety violations including collisions and running a red light.Obviously, DS ∈ [0, 1].In the scenarios (a)-(d) depicted in Fig. 2, a safety violation rate (SVR) denotes a collision rate (CR).In the scenario (e), SVR includes not only CR but also a red-light violation rate (RVR).Furthermore, the training-time safety is measured by the total number of safety violations (TNSV) in the training.
In the human-in-the-loop experiment, apart from SR, a timeto-collision (TTC) metric is utilized to evaluate potential collision risks or driving safety.The acceleration of the ego vehicle Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
is utilized as a metric to measure driving smoothness and comfort.Additionally, the acceleration of the following vehicle is leveraged to analyze the influence of the ego vehicle's driving behaviors on surrounding traffic.

C. General Settings
All agents are trained for 2000 episodes in SUMO using five different random seeds.Except for the navigation task, where each episode includes a maximum of 300 time steps, all other tasks have episodes with a maximum of 30 time steps.For a comprehensive evaluation, we set up three traffic flows with different densities, namely flow-0, flow-1, and flow-2.In the flow-0, flow-1 and flow-2, the probabilities of emitting a vehicle each second are set to 0.5, 0.3 and 0.7, respectively.All agents are trained in the flow-0, while the flow-1 and flow-2 are solely leveraged for testing.During the model testing phase, we evaluate the final policy models trained with all the algorithms and different random seeds.All the methods utilize the same policy network configuration.For further details such as reward function and hyperparameters, please refer to the supplementary.

D. Traffic Negotiation at Unsignalized Intersections
Task: In the scenario (a) depicted in Fig. 2, the ego vehicle (i.e., the red-colored vehicle) is executing an unprotected left turn at an unsignalized intersection while interacting with an oncoming dynamic traffic flow.In the scenario (b), the ego vehicle is carrying out a right turn at an unsignalized intersection while interacting with a crossing dynamic traffic flow.In the scenario (c), the ego vehicle is performing an unprotected left turn at an unsignalized intersection while interacting with an oncoming dynamic traffic flow and two crossing dynamic traffic flows.In the scenario (d), the ego vehicle is required to negotiate with an oncoming dynamic traffic flow and two crossing dynamic traffic flows in order to cross an unsignalized intersection.
State and Action: We adopt the information from the 6 nearest vehicles within a 200-meter distance from the ego vehicle, encompassing the relative distance, orientation, speed, and velocity direction of the front, back, left-front, left-back, right-front, and right-back vehicles.Moreover, we incorporate the speed and velocity direction of the ego vehicle, resulting in a state representation of the agent with a total of 26 dimensions.Here, the action of agents is continuous longitudinal acceleration or deceleration.
Additionally, in Table I, we present summary statistics that assess the average performance of each method across all testing conditions.For instance, according to the average DS metric in the last column of Table I, in contrast to the IDM, PPO, SAC, CPO, SAC-Lag, SMBPO, SMBPPO, GAIL and Roach agents, FNI-RL gains approximately 2.08%, 40.00%, 13.95%, 11.36%, 8.89%, 10.11%, 7.69%, 34.25% and 30.67% improvements with respect to DS, respectively.We find that the rule-based IDM agent exhibits strong competitiveness.Specifically, FNI-RL performs comparably to IDM on the easier tasks and surpasses IDM on the more challenging tasks in terms of the overall driving performance.

E. Long-Term Goal-Driven Navigation
Task: In the scenario (e) of Fig. 2, the ego vehicle first executes an unprotected left turn at an unsignalized intersection while interacting with an oncoming dynamic traffic flow and two crossing dynamic traffic flows.Then, the ego vehicle performs a right turn at an unsignalized intersection while navigating a crossing dynamic traffic flow.Following that, the ego vehicle is required to sequentially traverse an unsignalized intersection and a signalized intersection while interacting with dynamic traffic flows.Afterward, the ego vehicle merges into moving highway traffic from a highway on-ramp and engages in a high-speed cruising task with dynamic traffic flows.Finally, the ego vehicle is tasked with exiting the highway at an offramp.Here successful driving refers to the vehicle arriving at the off-ramp from the starting point without any collisions or running red lights.The total length of the task is 2400 m (700 m + 1700 m) in the east-west direction and 600 m in the north-south direction.
State and Action: In this task, apart from utilizing the 26dimensional state in the scenarios (a)-(d), the agent incorporates three additional states: the distance from the traffic light, the status of the traffic light, and the distance from the navigation target.Consequently, the agent's state encompasses a total of 29 dimensions.Furthermore, the action of the agent includes continuous longitudinal acceleration (or deceleration) as well as lane change direction.Evaluation: Here, we assess and compare the performance of FNI-RL against the nine baseline approaches.Fig. 3 illustrates the training performance of the nine learning-based autonomous driving agents on the long-term goal-driven navigation task under the flow-0 condition.Quantitatively, we provide the average metrics of the last 100 training episodes for each learning-based method under different random seeds, as shown in Table II.Correspondingly, we assess the rule-based IDM baseline using the test results from 500 episodes.Fig. 3 and Table II demonstrate that, overall, FNI-RL surpasses the baselines with a large margin, in terms of the DS, SR, CR, and TNSV metrics, while performing comparably to the competitive baseline methods in terms of RVR.Specifically, in comparison with the IDM, SAC, SAC-Lag and SMBPO agents, the DS metric of FNI-RL is improved by approximately 78.72%, 64.19%, 39.20% and 14.49%, respectively.Compared with the IDM, SAC, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II ASSESSMENT RESULTS OF THE RULE-BASED AND LEARNING-BASED AUTONOMOUS DRIVING AGENTS IN THE LONG-TERM GOAL-DRIVEN NAVIGATION BENCHMARK
SAC-Lag and agents, FNI-RL gains approximately 95.65%, 83.16%, 10.67% improvements with respect to the SR metric, respectively.It is evident that on this challenging long-term goal-driven navigation task, autonomous driving agents trained using baseline methods struggle to effectively avoid collision incidents compared to FNI-RL.In contrast to the PPO, SAC, CPO, SAC-Lag, SMBPO, SMBPPO, GAIL, and Roach agents, the TNSV metric of FNI-RL is approximately reduced by 81.30%, 73.58%, 79.69%, 67.87%, 32.96%, 79.77%, 76.71% and 79.94% in 2000 training episodes, respectively.We observe that the majority of the autonomous driving agents excel at avoiding running red lights rather than avoiding collisions in the random and dynamic traffic environment.For instance, the rule-based IDM and learning-based Roach methods can ensure complete compliance with red light instructions; however, they prove less effective in enabling autonomous driving agents to avoid collisions effectively.Additionally, we find that the three on-policy RL baselines (i.e., PPO, CPO and SMBPPO) fail to make distinct progress in terms of DS and SR.Unlike off-policy RL methods, which store experiences in a replay buffer for learning, on-policy RL approaches directly update their policy based on the experiences collected during each episode or trajectory.This distinction may be a disadvantage for solving the challenging long-term goal-driven navigation task.In addition, since both GAIL and Roach are based on on-policy RL and the IDM-based demonstration data is of insufficient quality, they similarly fail to achieve the competitive outcomes on this complicated task.

F. Human-in-the-Loop Experiment
Task: In Fig. 4(a), we construct three cut-in scenarios (scene-0, scene-1 and scene-2) with different levels of aggressiveness (normal, aggressive and extremely aggressive) to assess the performance of our FNI-RL agent in safety-critical situations compared to 30 certified human drivers.The aggressiveness of the cut-in vehicle is manifested differently in the hesitation time and the longitudinal distance to the maneuver endpoint.The hesitation time is defined as maintaining the original velocity and not initiating any lane changes, and the maneuver endpoint is the longitudinal position at which the cut-in vehicle completes its lane change.The ego vehicle is in the leftmost lane.For the formal experiment, each scenario is repeated five times to assess the average performance of the human and FNI-RL drivers.Finally, we analyze and assess the data derived from the human drivers and the FNI-RL agents, with each participant conducting 5 repeated trials.Since it would be extremely dangerous to perform emergency collision avoidance tasks in a real vehicle, the experiment is conducted in safety-critical situations with the human-in-the-loop platform with the high-fidelity CARLA simulator.The detailed description of the experiment can be found in the supplementary.
State and Action: To demonstrate the advantages of our method, for the cut-in scene we constructed, the FNI-RL agent only adopts the information from the 3 nearest vehicles within a 200-meter distance from the ego vehicle, consisting of 7 dimensions, including the ego vehicle's speed, the speed and relative distance of the nearest front and rear vehicles, and the speed and relative distance of the nearest right-side vehicle.Instead, the human drivers can observe relevant information such as the distance and speed of almost all surrounding vehicles in the traffic environment through the screens on the platform.Here, the action of our autonomous driving agent is a continuous control of longitudinal acceleration or deceleration.
Evaluation: The experimental results obtained from three distinct scenarios are evaluated using four different metrics.In Fig. 5(a), the success rate is computed by the ratio of successful runs to total runs.A successful run is defined as a trial where the ego vehicle avoids collision with any of the surrounding social vehicles throughout the course of the run.The human drivers recorded success rates of 81.3%, 76.0%, and 70.0% for each scenario respectively.Surprisingly, our FNI-RL agent consistently outperforms the human drivers in all scenarios, achieving a success rate of 100% in each case.Statistical analysis, employing a paired t-test, confirms the superior performance of the FNI-RL agent, where p < 1e-4 for all cases.Fig. 5(b) illustrates the average reciprocal TTC of the ego vehicle with respect to the cut-in vehicle; a higher value suggests a higher risk.The FNI-RL agent consistently exhibits greater safety than human drivers, as evidenced by lower reciprocal TTC values across all scenarios.Statistical significance of this superiority is validated with p < 1e-4 for all cases.In Fig. 5(c), the FNI-RL agent showcases smoother driving across all scenarios, as supported by its lower average acceleration values in comparison to human drivers.Statistical tests confirm the significance of this difference, with p < 1e-2 for scene-0 and p < 1e-4 for scenes-1 and scenes-2.In Fig. 5(d), compared to human drivers, the FNI-RL agent maintains a smaller and more stable effect on the rear vehicle, consequently enhancing overall traffic performance.This improvement is substantiated through t-tests, as depicted in Fig. 5(d).
Overall, FNI-RL performs comparably to the baselines of changing hyperparameters and outperforms the baselines of removing critical components, in terms of the final DS, SR  III, we can see that the component regarding the fear model has a significant impact on the performance of FNI-RL, especially in safety.In addition, by comparing the "α = 0.5", "β = 0.8", "f 0 = 0.1" and "m = 10" baselines, we can find that hyperparameters have a certain impact on the performance of FNI-RL, but in general FNI-RL is not very sensitive to changes in hyperparameters.Consequently, the results of the ablation analysis demonstrate that the components or setting in FNI-RL are critical.More results can be found in the supplementary.

V. DISCUSSION AND CONCLUSION
Performance: Inspired by the amygdala, which arouses the fear and defensive behaviors of organisms in response to the recognition of dangers or contingencies, we propose the FNI-RL framework to realize safe autonomous driving.
The results demonstrate the effectiveness of FNI-RL via simulations and experiments.In the scenarios (a)-(e), FNI-RL achieves superior performance to that of the competitive AI agents, especially in terms of safety.In the human-in-the-loop Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
experiment, one obstacle to evaluating our agent is the "transfer gap": the performance of the well-trained agent in the SUMO-based simulation can be easily degraded in the experiment.One major reason for this problem may be the differences in the vehicle models between the two environments.Surprisingly, the experimental results indicate that FNI-RL can achieve the performance of the 30 certified human drivers in three safety-critical scenarios.Additionally, the ablation studies show that the components in FNI-RL to simulate the amygdala mechanism are critical.
Diving Deeper Into the Results: We find four possible explanations for the above results.(1) Threats and contingencies can be recognized or estimated with the fear model.FNI-RL selects the action that minimizes fear during interactions with the real environment.(2) While prediction error is unavoidable, by combining the adversarial agent with the world model, the adversarial imagination technique is able to simulate the worst-case situations in the imagination, enabling the agent to tackle unseen critical situations and improve its policy robustness against the "transfer gap" or uncertainties.(3) The FC-AC algorithm enables the agent to learn defensive driving behaviors that ensure safety or performance during emergencies.(4) Compared with human drivers, autonomous driving systems have faster reaction times and are fatigue-proof in terms of their functioning.
Broader Impact: RL has been an impressive component of modern AI and is still under vigorous development.Nonetheless, unlike supervised learning, which has found extensive application in various commercial and industrial domains, RL has not gained widespread acceptance and deployment in realworld tasks.One important aspect is the trustworthiness, where safety plays a critical role.Compared to AI, especially RL, human intelligence is considered safer and more trustworthy.Our framework inspired by the brain fear circuit contributes to the foundation for realizing safe AI, potentially bringing RL closer to safety-critical real-world applications.Moreover, this work establishes linkages between AI, neuroscience and psychology, which may be beneficial for interpreting the RL process in the brain.
Limitations and Future Work: Our algorithm implementation has several simplifications (e.g., its network structure and limited states) for the convenience of simulation and experimentation.We believe that neural networks considering temporal sequences, e.g., transformer [1], could improve the performance of FNI-RL, and this topic will be studied in the future.Additionally, the amygdala enables organisms to learn at fast rates and track rapid changes in environments, while the striatum is more robust to noise [14].However, since the internal structure and mechanism of the amygdala and striatum remain unclear, FNI-RL has not lived up to its full potential.An additional investigation is required to elucidate the fundamental principles of the amygdala and striatum, fostering the development of RL-based computational models and high-level autonomous driving.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Fig. 1 .
Fig. 1.Schematic of the proposed FNI-RL framework for safe autonomous driving.(a) RL-related functional systems in the brain.(b) Adversarial imagination module for simulating the amygdala mechanism.(c) Fear-constrained actor-critic technique.(d) Agent-environment interaction loop.

Fig. 2 .
Fig. 2. Experimental traffic environments.(a) Unprotected left turn at an unsignalized intersection with oncoming traffic.(b) Right turn at an unsignalized intersection with crossing traffic.(c) Unprotected left turn at an unsignalized intersection with mixed traffic flows.(d) Crossing negotiation at an unsignalized intersection with mixed traffic flows.(e) Long-term goal-driven navigation with mixed traffic flows.

Fig. 3 .
Fig. 3. Training performance of the different autonomous driving agents on the long-term goal-driven navigation task based on the stochastic dynamic traffic flows.(a) Success rate.(b) Collision rate.(c) Red-light violation rate.

Fig. 4 .
Fig. 4. Human-in-the-loop experiment.(a) Cut-in scenarios with three levels of aggressiveness.The ego vehicle (i.e., the golden-colored vehicle in the leftmost lane) performs a high-speed cruising task while a nearby vehicle suddenly cuts into its lane.The ego vehicle should stay in its lane and avoid collisions to the greatest extent possible.(b) Experimental platform.The human drivers manipulate the steering wheel and pedals to control the ego vehicle.A computing platform and three heads-up displays provide a real-time, high-fidelity in-vehicle view.

Fig. 5 .
Fig. 5. Statistical results produced by the human drivers (blue bars) and the FNI-RL agents (orange bars).(a) Bar plot of the success rates of the human drivers and the FNI-RL agent.(b) Boxplot of the reciprocal of the time-to-collision values produced by the human drivers and the FNI-RL agent, where the time-to-collision is calculated based on the moment at which the cut-in vehicle reaches the ego lane, and a small but nonzero constant (0.1 s) is leveraged as the time-to-collision value for the unsuccessful trials.(c) Boxplot of the mean absolute value of the acceleration of the ego vehicle, where the counting range is 2 s from the time at which the cut-in behavior occurs.(d) Boxplot of the mean absolute value of the acceleration of the rear vehicle, where the counting range is 2 s from the time at which the cut-in behavior occurs.
Xiangkun He (Member, IEEE) received the PhD degree from the School of Vehicle and Mobility, Tsinghua University, Beijing, China, in 2019.From 2019 to 2021, he served as a senior researcher with Huawei Noah's Ark Lab.He is currently a research fellow with Nanyang Technological University, Singapore.His research interests include autonomous driving, reinforcement learning, trustworthy AI, decision and control.He received many awards or honors, selectively including the Tsinghua University Outstanding Doctoral Thesis Award in 2019, Best Paper Finalist at 2020 IEEE ICMA, 1st Class Outstanding Paper of China Journal of Highway and Transport in 2021, Huawei Major Technological Breakthrough Award in 2021, Best Paper Runner-Up Award at 2022 6th CAA International Conference on Vehicular Control and Intelligence, and Runner-Up at Intelligent Algorithm Final of 2022 Alibaba Global Future Vehicle Challenge.Wu Jingda (Graduate Student Member, IEEE) received the BS and MS degrees in mechanical engineering from the Beijing Institute of Technology, China, in 2016 and 2019, respectively.He is currently working toward the PhD degree with the School of Mechanical and Aerospace Engineering, Nanyang Technological University, Singapore.His research interests include human guidance-based reinforcement learning algorithms, human-artificial intelligence (AI) collaborated driving strategy design, and decision-making of autonomous vehicles.Zhiyu Huang (Graduate Student Member, IEEE) received the BE degree from the School of Automobile Engineering, Chongqing University, Chongqing, China, in 2019.He is currently working toward the PhD degree with the School of Mechanical and Aerospace Engineering, Nanyang Technological University, Singapore.His current research focuses on machine learning-based methods for decisionmaking in autonomous driving, including reinforcement learning, behavior prediction, and data-driven motion planning.

TABLE I STATISTICAL
RESULTS OF DIFFERENT AUTONOMOUS DRIVING AGENTS IN THE TRAFFIC SCENARIOS (A)-(D), INCLUDING THE MEAN AND STANDARD DEVIATION (IN BRACKETS)