Destabilizing Attack and Robust Defense for Inverter-Based Microgrids by Adversarial Deep Reinforcement Learning

The controllers of inverter-based resources (IBRs) can be adjustable by grid operators to facilitate regulation services. Considering the increasing integration of IBRs at power distribution level systems like microgrids, cyber security is becoming a major concern. This paper investigates the data-driven destabilizing attack and robust defense strategy based on adversarial deep reinforcement learning for inverter-based microgrids. Firstly, the full-order high-fidelity model and reduced-order small-signal model of typical inverter-based microgrids are recapitulated. Then the destabilizing attack on the droop control gains is analyzed, which reveals its impact on system small-signal stability. Finally, the attack and defense problems are formulated as Markov decision process (MDP) and adversarial MDP (AMDP). The problems are solved by twin delayed deep deterministic policy gradient (TD3) algorithm to find the least effort attack path of the system and obtain the corresponding robust defense strategy. The simulation studies are conducted in an inverter-based microgrid system with 4 IBRs and IEEE 123-bus system with 10 IBRs to evaluate the proposed method.

Abstract-The droop controllers of inverter-based resources (IBRs) can be adjustable by grid operators to facilitate regulation services.Considering the increasing integration of IBRs at power distribution level systems like microgrids, cyber security is becoming a major concern.This paper investigates the datadriven destabilizing attack and robust defense strategy based on adversarial deep reinforcement learning for inverter-based microgrids.Firstly, the full-order high-fidelity model and reducedorder small-signal model of typical inverter-based microgrids are recapitulated.Then the destabilizing attack on the droop control gains is analyzed, which reveals its impact on system small-signal stability.Finally, the attack and defense problems are formulated as Markov decision process (MDP) and adversarial MDP (AMDP).The problems are solved by twin delayed deep deterministic policy gradient (TD3) algorithm to find the least effort attack path of the system and obtain the corresponding robust defense strategy.The simulation studies are conducted in an inverter-based microgrid system with 4 IBRs and IEEE 123-bus system with 10 IBRs to evaluate the proposed method.

I. INTRODUCTION
T HE POWER system is facing the uphill challenge of high- level penetration of renewable generation, in order to meet the net-zero carbon target in the energy sector [1], [2].In distribution-level microgrid systems, a large-scale of inverterbased resources (IBRs) is being connected to the power network in a distributed way.Different from bulk power systems dominated by synchronous generators, the dynamics of these inverter-based systems are determined by the control modes of power electronic interfaces.Besides, the power electronic devices present a much faster response than synchronous generators, which means the time scale of the network dynamics is comparable and cannot be ignored in stability analysis.To facilitate the regulation services in a time-varying system environment, certain control parameters of IBRs become adjustable or dispatchable.Like a doubleedged sword, the flexibility brought by the user-defined control systems of power converters will also increase the attack surface.Therefore, the vulnerability and cyber-security for inverter-integrated power systems are emerging but important problems to be investigated.
The cyber-security of bulk power systems with multimachines has raised concerns for a long time.Cyber-security of different processes in power systems has been studied, such as state estimation [3], power dispatch [4], and automatic generation control [5].In addition, a few works consider cyber-attacks for destabilizing dynamic power systems.In the early stage, the author in [6] introduces the destabilizing attack of power systems through the state-feedback controller.The synchronous generators are divided into control group manipulated by malicious attackers, and target group to be destabilized.The attack aims to shift certain sensitive eigenvalues from the left into the right plane.This method is later applied to mixed-source microgrids [7].In recent works, dynamic load-altering attacks are studied, as the wide adoption of demand response schemes increases the attack surface.In [8], the attack on the dynamic loads aims to destabilize the power systems, where the victim loads are changed based on the feedback of system frequency.A non-convex optimization problem is formulated to determine the minimum amount of load to be protected at each bus.In [9], the latency attack on the automatic generation control of the power system and its impact on system stability is studied.A parameter tuning method based on an exhaustive and heuristic search is proposed to maximize the stability region under such attack.
In the meantime, the cyber-security problem has also raised much attention in power electronics-enriched systems like microgrids.The wide integration of IBRs increases system flexibility while decreasing system security.A large amount of work has been conducted on the secondary control systems for microgrids, as its attack surface is enlarged with the utilization of the communication systems [10].The impact of typical attacks such as false data injection (FDI) and denialof-service (DoS) are investigated [11], [12].Methodologies have been provided for cyber attack prevention, detection, 1949-3053 c 2023 IEEE.Personal use is permitted, but republication/redistribution requires IEEE permission.
isolation, and mitigation for network-controlled microgrids.
The resilient control and detection indexes are designed considering the specific consensus algorithms in the secondary control of microgrids.It is noted that the FDI and DoS attacks mainly influence the system operational points targeting to make the system violate the security boundary [13].
It can be found that prior works on destabilizing attacks focused on synchronous generator dominated power systems [6], [8], [9].There is minimal work on the cyber-attacks target to small-signal stability and its defense mechanism of inverter-based systems like microgrids.It motivates us to further study this problem.Specifically, the least effort attack with minimal droop parameter change to destabilize the inverterbased microgrids is studied.It will help to understand the system's vulnerable parameters and manifestation under destabilizing attacks.In addition, it also contributes to developing corresponding defensive mechanisms to mitigate its impact.
These kinds of attack and defense problems can be formulated as dynamic programming or optimal control problems.However, these problems usually involve non-linear dynamics of system models and non-convexity in solving system eigenvalues.It innovates us to apply the data-driven method based on deep reinforcement learning (DRL) to find online approximate solutions to such problems.The DRL approaches have been widely used for power engineering problems, such as voltage control [14], frequency control [15], energy management [16], etc.A literature review of DRL application in power systems is provided in [17].By interacting with the dynamic environment, the DRL algorithms can train the deep neural networks (DNNs) based agents to find an optimal control policy.Based on the policy, DRL methods can be divided into deterministic policy, e.g., deep deterministic policy gradient (DDPG), and stochastic policy, e.g., proximal policy optimization (PPO) and soft actor-critic (SAC).The candidates of DRL have been applied to address cyber-security problems of microgrids and power systems in some recent works [18], [19], [20].In [18], a multi-agent deep Q network approach is proposed to detect the vulnerable spots in the index-based detection schemes for the secondary control in islanded DC microgrids.In [19], DRL based method is proposed for providing optimal defense strategy for microgrids subject to FDI on the load demand.In [20], an asynchronous advantage actor-critic (A3C) based multi-agent DRL is proposed to provide resilient control for the secondary control of microgrids to alleviate the impact of DoS attacks.In addition, the method of adversarial reinforcement learning has been proposed to find robust control solutions for voltage var control problems in power distribution networks with uncertainty in the environment [21].The adversarial training of DRL agents has been proposed for robust continuous control with attackers in cyber-physical power systems [22].This approach demonstrates its potential for addressing the destabilizing attack and robust defense problem in inverter-based systems.
In this paper, the cyber-attack and defense strategy in inverter-based microgrids is studied systematically.Specifically, the impacts of destabilizing attacks on droop control gains to the system stability are analyzed.The attack functions to shift the system shrinking the small-signal stability region by manipulating droop gains.The analysis reveals that such attacks can be defended by changing sensitive droop gains of the system.Then the least effort attack (LEA) and its defense problems are introduced correspondingly.The attack and defense problems are formulated as Markov decision process (MDP) and adversarial MDP (AMDP).The twin delayed deep deterministic policy gradient (TD3), as a deterministic policy DRL method, is proposed to identify the dynamic LEA for inverter-based systems.Compared to stochastic policy, the agent with deterministic policy by TD3 can provide a deterministic action to adjust the droop gains in the dynamic system.Besides, an adversarial reinforcement learning framework is adopted to find the dynamic and robust defense strategy under LEA.The distinct contributions of this paper compared to existing works are: • Considering the small-signal stability in inverter-based systems, the destabilizing attack is modelled and analyzed for the first time.
• The attack and defense problems are formulated as finding the optimal combination of droop gains within attack and defense sets in inverter-based systems.• The TD3 algorithm is adopted for training the attack agents, while the robust defense strategy is generated by adversarial training between attack and defense agents.

II. SYSTEM MODELLING
To investigate the destabilizing attack on the system stability, the dynamic model of multi-inverter microgrid systems is presented.Based on the full-order highfidelity model, the reduced-order small-signal model can be derived [23], [24], [25].They are used to calculate the system trajectory as well as the trace of eigenvalues under cyberattack.The system model consists of the dynamics of inverters, network and loads, and the transformation between local and common frames.The network dynamics are taken into account in the system as the IBRs respond quite fast as compared to synchronous generators.

A. Modelling of Inverter-Based Microgrids
A typical inverter-based microgrid with multiple IBRs governed by grid-forming and droop control is considered in this paper, as demonstrated in Fig. 1.The power network of the system can be represented by a complex-weighted graph G = (V, E), where the nodes V represent the buses, and the edges E represent the line connections.The loads and inverters are connected sparsely at each bus.
For generality, it is considered that the DC-side voltage is well maintained at the primary side.The inverter dynamics can be modelled with a continuous average model as its high switching frequency.The modelling is conducted in dq frame which can be converted to abc frame by Park Transformation.The local dq frame can be transferred into a common reference Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.DQ frame as follows [23]: where The angle of ith inverter is calculated by: where ω com is the common reference frequency.1) Inverter Modelling: The droop control for power inverter of IBRs is designed with the philosophy of emulating the behavior of synchronous generators to share the load demand based on frequency deviation.Similarly, the reactive power can be shared by droop control with the voltage magnitude.Considering a first-order filter in the power calculation process, they can be represented as [25]: where ω i , V i are frequency and voltage references for inner control loops, ω n , V n are nominal value of frequency and voltage, P i , Q i are measured real and reactive power, m i , n i are corresponding droop gains.τ = 1 ω c is the low-pass filter time constant for the power measurement, ω c is the cut-off frequency.It is noted that the output voltage magnitude is aligned to the local d-axis of the inverter reference frame , while the q-axis reference is zero (v * qi = 0).The droop gains are typically selected based on allowable frequency and voltage range, as follows [23]: where ω i , ω i , P i , and The current control loop is as follows: where φ di , φ qi , γ di , and γ qi are state variables of voltage and current control loops.K PVi , K IVi , K PCi , and K ICi are proportional and integral gains of voltage and current control loops.i * ldi , i * lqi are the reference generated by voltage control, which will be tracked by current control.i ldi , v iqi are the current and voltage measurement before the LC filter.
The differential equations for the output LC filter are as follows: where R fi , L fi are the resistance and inductance of ith inverter.
2) Network and Loads: For a multi-inverter system, the interconnected variables of each inverter with the network and loads should be transferred between local dq frame and the common DQ frame.Specifically, the output voltage of the inverter is transferred to DQ frame by V DQ = T (δ)v dq .For the distribution line between bus i and bus k, the dynamic of the line current in DQ frame is represented as: where R ik , L ik are the resistance and inductance between bus i and k. ω 0 is a constant synchronous frequency.An equivalent resistance-inductance (RL) load is considered at each bus in the systems.The dynamic of RL load connected at bus i in DQ frame can be expressed as: where R Li , L Li indicate equivalent resistance and inductance of RL load in bus i.As the current is balanced at each bus, thus the current injection by each inverter are Then the current of the inverter can be transferred back to local dq frame by i dq = T (δ) −1 I DQ .Based Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
on the power calculation in local dq frame, it can be obtained that: Then the calculated power P i and Q i are applied in the droop control in ( 3) and ( 4).

B. Small-Signal Model
The small-signal model is widely used to analyze the stability of inverter-based microgrids.The 5-order system by simplifying the inner control loops and LC filter dynamics in ( 7)-( 18) is considered.The reduced order system still offers high accuracy for calculating the eigenvalues and evaluating the system stability [25].By linearizing the above system around the operational or equilibrium point using Taylor expansion, the small-signal model can be obtained.The equilibrium point can be obtained by solving the differential equations of the full-order high-fidelity model.Therefore, by integrating the state equation of inverters, network, and loads, the small-signal model of the multi-inverter system can be obtained as: ẋsys = A sys x sys (25) where tors of line current and load current.An incidence matrix ∇ T of the power network is introduced, where ∇ T ij = 1 if current of jth line is injected to ith bus, ∇ T ij = −1 represents the current of jth line leaves ith bus.x s ∈ {δ s , ω s , V s , I s D , I s Q } is the equilibrium points of the system, A sys is detailed coefficient matrix of the system, which is given as follows where V s D , V s Q , I s D , I s Q , δ s are equilibrium points in diagonal matrix form, which can be obtained from the time-domain simulation of the non-linear model presented in Section II-A.M = diag{m 1 , m 2 , . . ., m N }, N = diag{n 1 , n 2 , . . ., n N } are a diagonal matrix of droop control gains.R and X are resistance and inductance matrices with network and loads.It is noted that this model is scalable according to the inverters, buses, and loads in the system.The derivation process of this model is omitted for brevity.The system contains 3N Inv + 2N Load + 2N Line of states.N Inv , N Load , N Line are the number of inverters, loads, and distribution lines.

III. ANALYSIS OF DESTABILIZING ATTACK AND DEFENSE
ON INVERTER-BASED MICROGRIDS In the studied multi-inverter systems, the droop control gains of each IBR are adjustable.It can be changed to adapt to grid conditions, or dispatched by the system operator via communication systems [7].In the meantime, the parameters of inner control loops are particularly designed for each inverter by the manufacturer, which is usually non-changeable.The attack surface of the multi-inverter systems are considered as these flexible parameters, such as droop control gains and their power set-points.The droop control gains will influence the stability of the system, while the power set points influence the equilibrium points.In this study, we mainly focus on destabilizing attacks by adjusting the droop control gains and their influence on the small-signal stability of the system.

A. Attack and Defense on Droop Gains
First, all the droop gains of IBRs in the inverter-based microgrids are separated into two sets.The attack set V att contains droop gains which can be manipulated by attackers to destabilize the system.The defense set V def contains droop gains which can be controlled by defenders to stabilize the system.Based on the attack and defense sets, the IBRs in the system can be separated into victim IBRs and defense IBRs.Therefore, the system under attack and defense can be formulated as where u att is the cyber-attack strategy target to the stability of the system.u def is the defense strategy to maintain stability of the system.u att = 0 or u def = 0 means there is no cyber attack or defense control.The small-signal model of inverter-based microgrids with attack and defense on droop gains becomes ẋsys = A sys + A att + A def x sys (29) where A att is a matrix denoting the droop control gain change in victim IBRs.Specifically, the original droop gains m i , n i within the attack set V att will be manipulated by m att,i , n att,i .That is to say all terms in M and N within the attack set will be changed as compared to the original system matrix A sys .It will finally influence the small-signal stability of the multiinverter system.To defend such attack, the system operators Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
can design certain strategies to change the droop gains of defense IBRs.A def is a matrix denoting the droop control gain change in defense IBRs.Specifically, the original droop gains m i , n i within the defense set V def will be changed by m def ,i , n def ,i .
Recall the definition of the eigenvalue and its eigenvectors: where φ i and ψ i are the right and left eigenvectors of λ i .The eigenvalue of A can be obtained by solving the determinant det(A − λ i I) = 0.The negative value for the real part of λ i indicates stable modes, while the zero value for marginally stable modes and the positive value for unstable modes.
The spectral abscissa of the system matrix A is the maximum real part of its eigenvalues, which can be presented as [26]: Here we further define (A) as the spectral abscissa with nonzero imaginary part Im{λ i } = 0, considering the fact that the eigenvalues will not shift to the right plane when they are on the real axis.
The damping ratio ζ i is defined as where α i and β i are the real and imaginary parts of λ i .It describes the attenuation of the system oscillations.In addition, the sensitivity of eigenvalue λ i with respect to a parameter κ can be calculated by It is noted that the calculation of eigenvalue involves nonconvexity, as well as its associated factors including spectral abscissa, damping ratio, and sensitivity of eigenvalue regarding system parameters, which brings difficulty into related optimization problems [8], [27].

B. Analysis With a Microgrid Example
The destabilizing attack on droop gains aims to shift the eigenvalues of the original system A sys into the unstable ones.A microgrid system with 4 IBRs is used as an example to show how droop gain change will influence the system's stability.The detailed parameters of the microgrid with 4 IBRs are shown in Table I.The system stability region of the smallsignal model regarding frequency and voltage droop gains is shown in Fig. 2. The eigenloci under the change of m i in the small-signal model is shown in Fig. 3.As shown in Fig. 2 (a), there are two dimensions in the stability region to be changed, i.e., frequency and voltage droop gains m i , n i .As shown in Fig. 2 (b), the changing of m 3 in IBR-3 will shift the stability region of IBR-4.It indicates the defender can change the droop gains of certain IBRs in order to stabilize the system under the attack of other IBRs.As shown in Fig. 2, by changing the m i from 5×10 −5 to 1×10 −3 respectively, the mode of  λ 15 and λ 16 will be moved towards to right plane.With a sufficient amount of manipulation of droop gains, the system will become unstable.The defender has the opposite goal, which aims to allocate all eigenvalue to the left plane.Besides, it can be found from Fig. 2 (a) that IBR-3 has the smallest stability region.It indicates the IBR-3 is the most vulnerable part under destabilizing attack, which should be well protected.In addition, in order to defend against an attack on certain droop gains, the defender should have more resources than the attacker, so that the eigenvalue can be shifted to desired regions.
Considering the attack and defense sets of IBRs, there are three general cases: (a) The attack and defense sets have no intersection, i.e., V att ∩ V def = ∅.There are two sub-conditions: i) The IBRs under attack and defense are different.In this condition, the defender can change m i of other IBRs to make the destabilizing attack not successful, as shown in Fig. 2  adjust n i to make the system operation point in the stability region.Similarly, if the attacker manipulates n i of certain IBR, the defender can reduce m i below a certain value.
(b) The attack and defense sets are equal, i.e., V att = V def .It happens when certain inverters are subject to attack, but the defender does not lose control ability of them.If the defender can adapt to the changes of the attackers, then the attack can be defended.
(c) The attack and defense sets have a partial intersection, i.e., V att ∩ V def = ∅.This is a more general condition as compared to cases (a) and (b), a mixed strategy can be taken by the defenders.Therefore, a proper method should be developed to find the combination of droop gains in the defense set.

C. Least Effort Attack
From victim IBRs, one can find the least effort attack with minimal changes of droop control gains.Thus, the least effort attack is defined as the attack which has minimal changes of m att,i and n att,i within attack set V att .The This problem is to find the argument of m att,i and n att,i that minimizes the above problem.Equation ( 36) is the determinant for calculation of the system eigenvalues.Equation (37) represents that the system spectral abscissa should be larger than zero, which leads to system unstable.Equation (37) can be replaced by (A sys + A att ) = * att , if the attack aims to place the spectral abscissa into a specific value.Besides, λ i or ζ i can also be replaced into the constraints if the attack targets to specific eigenvalues and modes.Inequalities (38)-(39) impose upper and lower bounds on total droop gains of each IBR.They are the preset limits of the IBR which can not be violated.
The above formulation is for the static LEA, which does not consider the system change.Considering this problem in a dynamic environment, it is equivalent to find the optimal attack strategy u * att,t considering dynamic system in (26).As the small-signal model can describe the system stability at each time interval.Therefore, by considering time interval t into the above LEA problem, the sequential or dynamic LEA can be formulated.Both LEA problems contain the calculation of the eigenvalues and spectral abscissa of the system under attack.As the eigenvalue sensitivity in (34) of the studied system is highly non-linear, it is hard to estimate the final value based on the original condition.Besides, in real operation conditions, there will be parameter variations in the inverter-based microgrids.

D. Defense Strategy
To defend dynamic LEA on droop gains in inverter-based microgrids, a defense strategy can be designed to change the droop gain by m def ,i and n def ,i within defense set V def .Therefore, the defense problem can be represented as min This defense problem is similar to the above dynamic LEA problem.It aims to stabilize the system by adjusting the droop gains.Equation ( 42) can be replaced by (A sys,t + A att,t + A def ,t ) = * def , if the defender aims to place the spectral abscissa into a specific value.
Again, it involves the calculation of eigenvalues and system non-linear dynamics, which makes the problem hard to be dealt with.In the next section, we propose to use deep reinforcement learning to obtain the online optimal solution for these attack and defense problems.

IV. ATTACK AND DEFENSE BY ADVERSARIAL DEEP REINFORCEMENT LEARNING
In this section, the dynamic LEA and its defense are presented in detail.Firstly, the dynamic LEA is formulated into a MDP.Then robust defense problem under such an attack is formulated as an AMDP.The training and implementation framework is demonstrated in Fig. 4. As shown in Fig. 4, the reduced order small-signal model will calculate the eigenvalue of the system at each time interval during the offline training stage.The system equilibrium points are obtained from the high-fidelity model.In the online implementation stage, both the small-signal model and high-fidelity model can be applied, which function to simulate the system trajectory and calculate the system eigenvalue.The system eigenvalue is obtained by the attack agent and defense agent to train the optimal policy.The attack agent and defense agent have opposite objectives, i.e., destabilize and stabilize the system.Then the droop gains in the attack and defense sets will be updated in the next time interval.Next, we need to transfer the attack and defense problems into MDP and AMDP forms, so that they can be handled by DRL methods.In this paper, Fig. 4.
The training and implementation framework of the proposed framework.The TD3 is used to accomplish the attack and defense task in inverter-based systems.
TD3 is adopted to train DNN to learn the optimal attack and defense policy, which aims to find the optimal combination of droop gain change within the attack set and defense set.The deterministic policy can give a smoother output than stochastic policy in the studied problem.The TD3 as an extension method of DDPG, addresses the sub-optimal policies generated by the value function overestimation of DDPG.More details of TD3 can be found in its fundamental work [28].

A. Dynamic LEA by Deep Reinforcement Learning
In this paper, the attack problem is formulated as the MDP defined by the tuple (S, A, P, R), where S presents a set of states from the environment, A is a set of actions, P is a set of transition probability function, R is a set of immediate rewards.The attack agent will learn an optimal control policy through interaction with the environment.It is noted that only the concerning states in the environment are received by the attack agent.
At each time step t, the attack agent will receive a state s t ∈ S from the current state of the environment.Then the agent will generate an action a t ∈ A, which controls the environment into a new state s t+1 .The action is generated by a policy π : S → A such that a t = π(s t ).In each time step, the agent will also receive a reward r t ∈ R, which is a function of the state and the action, i.e., r : S × A → R. The transition between the environment states can be modelled by the transition probability function P(s t , a t , σ ), where σ represents the uncertainty in the environment.
The goal of the attack agent is to learn an optimal policy π * that maximizes the accumulated expected discounted reward J(π ) = E( T t=0 γ t r t ).Here T is the episode length and γ ∈ [0, 1] is a discount factor.To estimate the expected discounted reward by taking action a t following policy π in state s t , the Q value function is defined, i.e., Q π (s t , a t ) = Eπ [J(π ) | s 0 = s t , a 0 = a t ].In the actor and critic structure based DRL methods such as TD3, the Q value function is estimated by one or two DNNs as Q k (θ c,k |s t , a t ).Besides, the actor is also based on DNN to generate the deterministic policy a t = π(θ a |s t ).θ c,k and θ a are the parameters of DNNs for critics and actor.In this paper, TD3 is used for training the DNN to solve the formulated MDP.
1) State: In the studied dynamic LEA problem, the agent will receive certain states from the environment.At time step t, the measured state is represented by: where att = | (A sys + A att ) − * att | is the error to attacker targeted spectral abscissa, which can be calculated from the small-signal model.
2) Action: In this paper, the actions generated by the agent is defined as droop gains in the attack set.It aims to find the least effort attack path for system instability.Therefore, the action set of attack agent is defined as: where m att,i,t , n att,i,t refer to droop gains to be changed in the attack set.The total droop gains should be within the limits [m i , m i ], and [n i , n i ], as defined in (38), (39).
3) State Transition: The system state transition is governed by s t+1 = P(s t , a t , σ t ), which is determined jointly by current state s t , agent action a t and environment uncertainty σ t .σ t refers to the system uncertainty, e.g., parameters of droop gains and RL value of loads.The agent will gradually learn the characteristics from the data sources of the environment.
4) Reward: The reward function r t is used to evaluate the performance of action a t at state s t .The reward function is defined that the attack problem in ( 35)-( 39) can be solved considering the stochastic environment.Thus the reward can be defined as: The reward function contains two parts r 1,t and r 2,t .The first part aims to find the minimal sum of m att,i and n att,i in the attacks set, as defined in (35) The second part is to shift the spectral abscissa with non-zero imaginary part to a desired positive value * att , as defined in (37).

B. Robust Defense by Adversarial Reinforcement Learning
Given an agent with attack policy π , we wish to learn a policy π to defend such an attack.This can be achieved by solving the AMDP problem (S , A , P , R ).Similar to the attack problem as MDP, S is the set of system states including the attack agent.A is the set of all available defense actions.P : S ×A×A is the transition probability under attack policy π and defense policy π .R is the reward function of defense agent, which can be chosen as the opposite reward of the attack agent.The defense agent seeks to minimize the expected reward with the attack agent as min π max π J (π , π).As the iterative training of attack policy π and defend policy π converges slowly and does not provide greater robustness [29].
Here we adopt the alternative way which is to fix the attack policy π when training the defense policy a t = π (θ a |s t ).
Algorithm 1 Adversarial Training 1: Import trained attack agent with policy π ∼θ a .2: Initialize defense agent with randomized actor network π ∼ θ a and critic networks Q 1 ∼ θ c,1 ,Q 2 ∼ θ c,2 .The target networks are of the same size.3: Set training hyperparameters of TD3 as in Table II 4: for episode = 1 to M do 5: Initialize state s 1 and droop gains m i , n i within a range.6: for t = 1 to T do 7: Determine action a t by policy π (θ a |s t ) 8; Take action a t , get reward r t and observe the next state s t+1 9: Store the transition s t , a t , r t into the replay buffer R. 10: end for 11: A mini-batch of m instances is randomly sampled from R. 12: Update the actor and critic networks parameters with policy gradient by TD3.13: end for 1) State: The defense agent receives similar states from the environment.At time step t, the measured state is represented by: where def = | (A sys,t + A att,t + A def ,t ) − * def | is the error to defender targeted spectral abscissa.
2) Action: In this paper, the actions generated by the defense agent are defined as droop gains in the defense set.It aims to find a policy that stabilizes the system.The action set of defense agent is defined as: where m def ,i,t , n def ,i,t refer to droop gains to be changed in the defense set.The total droop gain should be within the limits [m i , m i ], and [n i , n i ], as given in (43), (44).
3) State Transition: The system state transition is governed s t+1 = P (s t , a t , a t , σ t ), which is determined jointly by current state s t , attacker agent action a t , defender agent action a t and environment uncertainty σ t .
4) Reward: The reward function r t is used to evaluate the performance of action a t at state s t .The reward function is defined that the defense problem in (40)-( 44) can be solved.Thus the reward of defense agent can be defined as:

1)
Test systems: To evaluate the performance of the proposed method, two test systems, including 4-IBR microgrids and IEEE 123 bus system with 10 IBRs, are considered.The structure of the 4-IBR microgrid is shown in Fig. 1.In this system, 4 IBRs are connected to the microgrid via inverters, and 4 RL loads are connected at each bus.The parameters of the system are given in Table I.Besides, the performance of the proposed method is also evaluated under a large-scale system by using IEEE 123-bus systems.
2) Training Setup: The TD3 algorithm is used to solve the MDP and AMDP as well as train the attack and defense agents.The hyperparameters of TD3 used for attack and defense agents training are presented in Table II.The time interval between two consecutive steps t = 0.1 s.The training is performed on a laptop with 3.00GHz Intel i7-1185g7 CPU and 16GB RAM.DNNs are initialized with random weights and biases, which include an actor network and double critic networks.All actor and critic networks have two hidden layers with 100 and 50 units.The number of neurons at input and output layers vary according to the specific problems.The ReLU activation function is used for all hidden layers of actor and critic networks.The Tanh activation function is subsequently applied to the output of the actor network.
3) Hyperparameter Selection: In this paper, typical values are selected as in Table II to ensure modest training performance.Some suggestions for hyperparameter selection are as follows.The experience buffer stores past transitions for training.A large value can store more diverse transitions, but also requires more memory.The batch size determines the number of transitions to be used in each iteration of the training process.A large batch size means that more transitions are used in each iteration.It can increase the accuracy of the gradients computed during training, but also requires more computational resources.The discount factor determines the importance of future rewards in the calculation of the expected return.A large discount factor can lead to a focus on long-term rewards, while a small discount factor can lead to a focus on short-term rewards.The learning rate determines how quickly the model updates its weights.A large learning rate may lead to instability, while a small learning rate may result in slow convergence.The exploration noise is added to the actions produced by the actor network to encourage the agent to explore new actions and states.The scale of the exploration Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.After the training is completed, the actor network can be extracted and used to find the dynamic LEA in a real-time environment.It is considered that the system operated at m i = 2.5×10 −4 and n i = 2.5×10 −4 when the simulation start.The droop gains of IBR-2 and IBR-4 and spectral abscissa of the system are shown in Fig. 6 7 (a) and (b), the system under initial droop gains will operate stably.However, the inclusion of attack agent on the system will lead to the system instability in Fig. 7 (c) and (d).The time-domain simulation with high-fidelity system validates that the attack targeted to maximum eigenvalue eventually destabilizes the system.

C. Case 2: Robust Defense Strategy
In the second case, the defense agent is added into the system of the basic case.The defense agent can change droop gains of IBR-2 and IBR-3, where the defense set contains m def ,2 , n def ,2 , m def ,3 and n def ,3 and has intersection with attack set.After adversarial training with attack agent, the actor network of defense agent can be deployed to defend such attack.The actions of droop gain changes from attack and defense agents, as well as spectral abscissa of the system, are shown in Fig. 8.In Fig. 8 (a) and (b), it can be found that certain droop gains change (e.g., droop gains of IBR-4) faster than others, and the defense droop changes slower than the attack droop.The underlying reason should be the defense droop are more sensitive to system eigenvalue or spectral abscissa than attack droop.Therefore, the defense agent can change them  slowly to deal with the rapid change of droop by the attack agent.Besides, the attack agent and defense agent will gradually reach an equilibrium.The attack agent will also change its output as compared to Case-1.After the adversarial training against the attack agent, the defense agent is capable to bring the system spectral abscissa to −2 to make the system stable.
The system eigenvalue before attack, after attacks, and with defense are demonstrated in Fig. 9.It can be found that the system critical eigenvalue shifts to the right plane after attack.The imaginary part changed from −5.4 to 2. After the defense agent involves, the imaginary part can shift back to −2.In the shifting process, the proposed method will keep minimal change of droop gains.

D. Case 3: Scalability Test in IEEE 123-Bus System
In the third case, the scalability of the proposed attack and defense framework is tested in a modified IEEE 123bus system, where details can be found in [25].The system topology is shown in Fig. 10.The IBRs are located at bus  As droop gains in defense set are more sensitive to system stability than in attack set.Therefore, the defense agent can find a dynamic combination of droop gains to ensure the system's small-signal stability.Besides, as shown in Fig. 11 (c), the defense agent is capable to maintain the system spectral abscissa to −2.
The trace of eigenvalues during the simulation is shown in the 2D-plot of Fig. 12, while the trace of critical eigenvalues is shown in the 3D-plot of Fig. 13.As shown in Fig. 12, it can be found that the eigenvalues with the largest real part are shifted from the red star point to the green star point.The system special abscissa is changed from   system critical eigenvalues (blue circle) during 0s-1.9s.λ 33 , λ 34 are the system critical eigenvalues (red asterisk) during 2s-20s.This result also aligns with previous findings.

VI. CONCLUSION
In this paper, the problem of destabilizing attacks on droop gains in inverter-based microgrids is studied, and the datadriven destabilizing attack and robust defense strategy are proposed.Firstly, the full-order model and linearized reducedorder small-signal model of typical multi-inverter systems are derived.Then the destabilizing attack on the droop gains and its defense strategy is analyzed.Finally, a deep reinforcement learning approach TD3 is proposed to find the least effort attack path of this system and obtain the robust defense strategy.The simulation test results validate the effectiveness of the proposed method.It is found that the proposed method can determine the optimal combination of droop gains with attack and defense sets.The system spectral abscissa will be shifted to the targeted position by using the proposed method.The defense strategy obtained by adversarial training has robustness against the destabilizing attack.The test on IEEE 123 bus system validates the scalability of the proposed approach.

Fig. 2 .
Fig. 2. (a) The stability region of the small-signal model regarding frequency and voltage droop gains of m i and n i .(b) The stability region of IBR-4 with respect to the change of frequency droop gain of IBR-3 m 3 .The left region of the curve indicates the system is stable.
(b). ii) The attacker can only manipulate either frequency droop gain m i or voltage droop gain n i .This condition can be found when frequency/voltage droop gains have different communication channels and are dispatched separately.It can be illustrated by Fig. 2 (a), where the attacker can only change m i or n i .If the attacker manipulates m i of certain IBR, the defender can

Fig. 3 .
Fig. 3.The eigenloci of the small-signal model regarding the change of m i .The range of m i is from 5 × 10 −5 to 1 × 10 −3 .The other parameters are fixed with original values in Table I when m i changes.

Fig. 5 .
Fig. 5.The average reward and episode reward during the training stage of the attack agent.

Fig. 7 .
Fig. 7.The system frequency and voltage trajectories with and without attack agent and the attacker targeted spectral abscissa is 1.

Fig. 8 .
Fig. 8.The actions of droop gain changes from attack and defense agents as well as spectral abscissa of the microgrid.

Fig. 9 .
Fig.9.The system eigenvalue of before attack, after attacks, and with both attack and defense.

Fig. 11 .
Fig. 11.The droop gain changes by attack and defense agents as well as spectral abscissa of the microgrid.{95, 149, 79, 5, 102, 112, 81, 91, 89, 47}.The IBR at bus {95, 149, 79, 5} is under attack, while the IBRs at bus {95, 149, 102, 112} is under defense.The droop gain changes by attack and defense agents in p.u are shown in Fig. 11 (a) and (b).As droop gains in defense set are more sensitive to system stability than in attack set.Therefore, the defense agent can find a dynamic combination of droop gains to ensure the system's small-signal stability.Besides, as shown in Fig.11 (c), the defense agent is capable to maintain the system spectral abscissa to −2.The trace of eigenvalues during the simulation is shown in the 2D-plot of Fig.12, while the trace of critical eigenvalues is shown in the 3D-plot of Fig.13.As shown in Fig.12, it can be found that the eigenvalues with the largest real part are shifted from the red star point to the green star point.The system special abscissa is changed from −5.38274 to −2.06296.It aligns with the time domain results shown in Fig.11 (c).The system critical eigenvalues are considered as eigenvalues with the largest real part.As shown in Fig.13, λ 19 , λ 20 are the −5.38274 to −2.06296.It aligns with the time domain results shown in Fig. 11 (c).The system critical eigenvalues are considered as eigenvalues with the largest real part.As shown in Fig. 13, λ 19 , λ 20 are the

Fig. 12 .
Fig. 12.The trace of system eigenvalues during the simulation in Case 3. The system special abscissa is changed from −5.38274 to −2.06296.

Fig. 13 .
Fig. 13.The trace of system critical eigenvalues during the simulation in Case 3. The blue circle is for λ 33 , λ 34 .The red asterisk is for λ 19 , λ 20 .
are upper and lower boundaries of frequency, real power, voltage, reactive power of ith inverter.The inverter output voltage v di , v qi are regulated to the reference voltage value v * di , v * qi determined by the droop controller.The voltage control loop is as follows:

TABLE I PARAMETERS
OF A MICROGRID SYSTEM WITH 4 IBRS n def ,i t∈T i∈V def |m def ,i,t | + |n def ,i,t | (40) s.t.det A sys,t + A att,t + A def ,t − λ i I = 0 (41) A sys,t + A att,t + A def ,t < 0 (42) The reward function contains two parts r 1,t and r 2,t .It is to find the minimal change of m def ,i and n def ,i in the defense set to shift the to a desired negative value * def , as given in (40) and (42).Based on the TD3 algorithm, the above AMDP can be solved.It is achieved by adversarial training of the attack agent with the defense agent.The process of adversarial training is summarized in Algorithm 1.