Reliability Analysis of the Control and Automation System in Electrical Substations

This work consists in a detailed reliability analysis of the control, automation, and communication equipment in an IEC61850 based digital electrical substation considering four different communication architectures. Three preventive maintenance plans are defined according to Birnbaum’s importance measure in order to evaluate if a more detailed planning proves to be beneficial. The method used to perform the substation’s reliability and availability analysis consists in a Markov-Monte Carlo algorithm. Moreover, a quantitative cost analysis throughout the substation’s life cycle is conducted to conclude about the economic feasibility of each architecture, allowing more objective decisions to be taken when considering the reliability requirements for electric power systems.This paper is an extract from the academic research conducted by the first author as requirement to obtain the Master of Science Degree in Electrical and Computer Engineering at Instituto Superior Técnico, Universidade de Lisboa, under the supervision of Prof. Paulo J. Costa Branco and the co-supervision of Eng. Andrés A. Zúñiga.


Acknowledgments
I would like to start by thanking Professor Paulo José da Costa Branco, for the innovative ideas and suggestions he provided me, and Andréz Alejandro Zúñiga Ródriguez, for being tireless throughout this journey, for the motivation he showed and especially for believing in me.
I also thank Gonçalo Marques Silva, from EDP Labelec. His knowledge and spirit of mutual help were fundamental for this work to be as complete as possible.
Finally, I would like to thank my family for the unconditional support they always gave me, especially Bruna, for everything she did for me. iv

. Context
In electric power systems, substations are a crucial element in the transmission and distribution of electrical energy. These infrastructures are mainly responsible for transforming electrical energy by stepping-up or down the voltage depending on the power system's requirements, whether it is to deliver energy to the electric grid or to supply the costumers. Hence, high voltage switching equipment and power transformers are necessary, as well as instrument transformers. These make possible for the secondary equipment, responsible for protection, control and metering, to measure the voltages and currents in the primary equipment, such as lines, busbars and power transformers.
Until a few decades ago, the control of electric substations was performed by systems consisting of discrete electronic or electromechanical elements, where several functions were carried out separately by specific subsystems. Despite the considerably high reliability of those arrangements, they are also quite expensive, as they require a large investment in wiring, panels and construction, along with regular human intervention [1]. These inconveniences, allied with the need to monitor consumption, prices and services, were seen as an opportunity to modernize the electric power system, which marked the beginning of the smart grid vision. Smart grids are intended to take advantage of all available modern technologies to transform the current grid into one that functions more intelligently and autonomously.
This means that smart grids must face some requirements to meet the challenges of the 21 st century needs, such as the following [2]: • Increasing complexity; • Growing demand; • Renewable sources; • Environmental and energy sustainability concerns; • Electric energy storage; • Greater grid reliability, security and efficiency.
Nowadays, the smart grid vision has increased due to policy and regulatory initiatives, and hence numerous and diverse stakeholders are striving to realize the smart grid goals by continuing deploying various new technologies. The automation functions for monitoring, protection and control within a substation and recent improvements in the fields of electronics and communication technologies are provided by Substation Automation Systems (SAS).
The application of the International Electrotechnical Commission 61850, which is an international standard defining communication protocols for Intelligent Electronic Devices (IEDs) within electrical substations, in SAS brings significant changes to its instrumentation, monitoring, communication, control and protection systems, providing more flexibility and a better performance of the SAS architectures [3].
The main objective of this standard is to facilitate interoperability, enabling logical configuration of the SAS by connecting various types of equipment from different manufacturers or different generations of equipment, even from the same manufacturer, through a Local Area Network (LAN) [4].

Problem definition
The primary function of an electric power system is to continuously provide electrical energy to all its customers, complying with all quality standards imposed by the energy regulatory agencies. In fact, modern society is used to expecting that electricity will be continuously available on demand. Although there has been a steady decrease in the number of interruptions in energy supply over the years, this is not possible due to random failures in the system, which, for the most part, are not under the control of the companies responsible for the management of the power system. The electricity supply system represents an extremely complex and highly integrated system, where the failure of one single element can cause interruptions. The effects of these interruptions range from inconveniencing a small number of residents closer to the failure origin, to major and widespread disruptions of supply, which can cause catastrophic events [5]. Due to this failure scenario, many reliability assessment approaches have been proposed; however, most of these are focused exclusively on either the primary equipment or the secondary equipment separately. In addition, assessing the impact of failures on secondary equipment in the power system is extremely complex. Therefore, there is a continual emergence of new approaches to assess these impacts.
According to the Third CIGRÉ Report on the Reliability of Circuit Breakers [6], which provides data of circuit breaker failures and defects in service collected during a period of four years (2004 -2007), electrical control and auxiliary circuits (e.g., tripping and closing circuits, auxiliary switches and associated drives, contactors, relays) are responsible for 27,8% of minor failures (MiF) and 30,0% of major failures (MaF), comparing with components at service voltage (e.g., breaking units, auxiliary interrupters) and operating mechanism (e.g., compressors, motors). A major failure will result in an immediate change in the system operating conditions or will result in mandatory removal from service for unscheduled maintenance. Minor failures are simply equipment failures which impact is not sufficient to be considered as a major failure. This report also indicates that the influence of the electrical control and auxiliary circuits in circuit breaker faults is increasing over the years.
In [7] is presented a sample distribution of substation components, including the control equipment, among the overall system fault distribution for the Nordic region during the years 2004 to 2013, as it can be seen in Figure 1.1. From the data shown, it is possible to conclude that failures in the substations' control equipment are the most frequent (representing 48% of all substation faults) and, depending on the severity of those failures, may be responsible for approximately 15% of power system's faults.

Motivation
Advances in electronics, information and communication technology are increasingly used in electrical substations to meet the demands of modern power systems. In turn, the increasing complexity of new electrical substations imply a deeper study on the reliability of protection, control and automation equipment, as well as a more rigorous coordination between them. In addition, the Portuguese power utility sector has been evolving in the past few years, being proven by the increasing power quality [8].
A key element is the constant innovation in the protection, command and control systems present in substations [9].
Thus, the underlying motivation of this dissertation is to study the existing technologies developed for the automation of substations, provided that these will enable a more reliable and efficient monitoring, operation, control and protection, as well as an increase in operator safety and quality of service.

Goals
The purpose of this work is to assess the reliability in electric substations considering the new technologies and equipment associated to control, automation and communication based on the IEC 61850 standard. The analysis made in this report will include a survey on existing IEC 61850 based equipment used in electrical substations in order to establish its operational context and reliability modeling. Then, the impact of failure of this equipment in the operation of the substation is evaluated through a detailed analysis, which encompasses the reliability and availability of the substation and an economic evaluation of the costs carried throughout its lifetime. This is performed considering four different communication architectures, responsible for the connection of all equipment in study. The goal of this work is to conclude about which architecture guarantees the best reliability and implies the lowest possible expenses.
In addition, a similar analysis considering equipment degradation is carried out, in order to understand if the difference between this approach and the one developed without degradation is significant. Furthermore, the goal with this analysis is to confront the model used in most reliability studies. Therefore, a new approach is proposed with the intention of being studied in greater detail in future work.

Document structure
The work developed in this dissertation is divided into two main sections, with the integration of three chapters each. Firstly, a theoretical framework is introduced in order to describe the context, essential concepts and methods used. As for the second main section, the work developed, the results obtained, and the respective conclusions are presented.
In chapter 1, the problems that led to the development of the reliability analysis of this dissertation are introduced, as well as its motivation, objectives proposed with it, and a brief description of the work developed.
Chapter 2 discusses the topic of digital substations, covering their main functions, requirements for current power system demands, and the technological evolution. In addition, the IEC 61850 standard is introduced, starting by the history of its creation to its implementation in electrical substations. Some works that served as support for the execution of this dissertation are also mentioned.
Chapter 3 focuses primarily on the presentation of the concepts and methods used in this work, starting with the basic concepts on reliability, followed by Markov processes, and finally the Monte Carlo simulation method.
The methodology and the case study substation on which this work is based are presented in chapter 4. In this chapter, all approaches and considerations for the reliability and availability analysis and for the economic evaluation are described.
In chapter 5, the results obtained through the application of the methodology developed in the substation under study are presented and commented. Moreover, a proposed model that considers equipment degradation in the reliability analysis is suggested.
Finally, chapter 6 discusses the main conclusions regarding the results obtained. The contributions of this study to the development of reliability analysis methodologies are also presented. Furthermore, the limitations imposed on this work are discussed and improvement suggestions are also presented, as well as proposals for future work.
Background and State of the Art

Digital Substations
A digital substation is an electrical substation which is managed by distributed IEDs and other technological advanced devices interconnected by communication networks [10]. Over the years, substation devices were developed with the purpose of incorporating microprocessors, thus providing increased functionalities and improving their accuracy and stability. Also, as substation communication capabilities were improved, it became possible to connect the substation devices to Supervisory Control and Data Acquisition (SCADA) equipment. It was also possible for substation operators to access relevant information via user interfaces using specialized software, which could easily run on their personal computers. As computing power became greater and cheaper, fewer devices were required to carry out the same functions as before. Due to the integration of several functionalities and communications within the devices, the traditionally separated protection, monitoring, and control functions converged, forming the currently known Substation Automation Systems [11].

Protection, Automation and Control
One of the primary functions of electrical substations is the protection of the power system. This function is responsible for identifying the existence of defects and for eliminating them by selectively shutting down the defective equipment. It is important to highlight that protection systems do not prevent defects, but rather act in response to them, isolating the faulty components from the rest of the system. Therefore, the protection systems, when properly configured, are limited to minimizing the consequences of those defects. The operation of protection systems can be divided into three functions, each one performed by different elements: • Measurement -performed by instrument transformers and measuring equipment; • Decision -performed by the protection equipment; • Actuation -performed by the power cutting equipment.
Furthermore, protection, alongside load and generation variations in the power system, is one of the main reasons for the need for automation and control systems. A substation automation system is a collection of hardware and software components that are used to monitor and control an electrical system, both locally and remotely. It is also responsible for the automation of some repetitive, tedious, and error-prone activities, which is of most importance for a continuous and overall increase in the availability of substations. Regarding the substation control system, it is required to facilitate manual or automatic reconfiguration of the power system, and data acquisition. It also allows the system to be reconfigured in response to changing conditions such as maintenance, or unexpected changes in power demand or outage due to a fault.

Technological Evolution of Protection Systems
At the end of the 19th century, mainly fuses made of lead, silver or similar were used for the protection of the power system [12]. The fuse is a device which, by melting or fusing its conductive element, opens the circuit in which it is inserted and hence stops the current flow when it is exceeded by a certain value within a certain amount of time. Although it works nicely and it is still used today in several applications, the fuse has limitations which make it unable to meet the required conditions for a complete protection equipment, since the only principle by which it operates is that of maximum current and most of them are single-use components. Thus, there was a need to introduce a protection equipment that was more complete, selective, and easier to put into operation again.
The development of the instantaneous and delayed acting overcurrent protection devices, known as relays, took place around 1900. Relays may be classified according to four different groups, depending on the technology used: electromechanical, static, digital, and numerical.
The first generation of relays corresponds to the electromechanical relays, which work on the principle of a mechanical force operating a contact in response to a stimulus. The mechanical force is generated through current flow in one or more windings on a magnetic core. The main advantage of such relays is that they provide galvanic isolation between the inputs and outputs in a simple, inexpensive, and reliable form [13]. These relays are still used for simple on/off switching functions where the output contacts carry substantial currents.
Regarding static relays, its introduction began in the early 1960s. Their design is based on the use of analogue electronic devices instead of coils and magnets. Earlier versions used discrete devices such as transistors and diodes with resistors, capacitors, and inductors. However, advances in electronics enabled the use of linear and digital integrated circuits in later versions for signal processing and implementation of logic functions. Although basic circuits were common to several relays, each protection function had its own case, so complex functions required several cases of interconnected hardware. User programming was restricted to the basic functions of adjustment of relay characteristic time-current curves. Therefore, static relays can be considered as an analogue electronic replacement for electromechanical relays, with some additional flexibility in settings and some saving in space requirements. The term 'static' refers to the absence of moving parts to create the relay characteristic, excluding the output contacts [14].
Digital protection relays introduced a step change in technology. Microprocessors and microcontrollers replaced analogue circuits to implement relay functions. Compared to static relays, digital relays use analogue to digital conversion of all measured quantities and use a microprocessor to implement the protection algorithm. The first examples were introduced around 1980, with major improvements in processing capacity, a wider range of settings, greater accuracy, and a communication link to a remote computer.
Numerical protection relays are the natural technological evolution of digital relays due to the continuous reduction in microprocessors' cost and size and increasing memory, which makes possible for a single device to be responsible for a variety of functions. In fact, modern IEDs are numerical protection relays, which encompass, in addition to protection functions, a variety of features related with: • Automation -Decision making without human intervention; • Control -Performing the necessary operations on the system (e.g., CB opening/closure); • Monitoring -Acquisition and processing of system data; • Supervision -Supervising the correct operation of the various system components; • Human-Machine Interface -Measured values and equipment state display; • Time Synchronization -Disturbance record has to be correlated with similar records from other sources to obtain a complete picture of an event; • Programable Logic -Simple and personalized equipment parameterization.
As summarized in Figure 2.1, the electromechanical relay has been replaced successively by static, digital and numerical relays, each change bringing with it reductions in size and improvements in functionality, reliability, and availability. This represents a tremendous achievement for all those involved in relay design and manufacture [13].
Despite the advantages brought by IEDs to power system's protection, automation, and control, the introduction of these devices replaces some of the issues of previous generations of relays with new ones, such as software version control, data management, testing, and commissioning [14]. These issues must be continuously addressed both from manufacturers and costumers to ensure that today's smart grid requirements are met.
Along with protection devices, the automation systems within substations have also suffered a significant transformation over the years. Since the first protection equipment did not have communication capabilities, status monitoring and substation control were made through hardwired connections to the RTU (Remote Terminal Unit) and/or central unit for logical execution. This system required a significant number of connections and a significant effort for event documentation.
As protection devices developed the ability to communicate with other substation components, series protocol-based systems emerged, which no longer had the need for RTUs. However, the communication between components was conducted through proprietary protocols, meaning that protocol design, implementation, and technical information is restricted (no parameterization is possible from costumers), and substation equipment and its functions were not standardized. Moreover, the integration of this automation system in substations required in-depth knowledge of each piece of equipment and protocol conversion, since not all equipment used the same communication protocol.
The publication of the IEC 61850 standard, which will be discussed in more detail in sub-chapter 2.2, allows the integration of system status monitoring in IEDs, along with a complete modelling of the substation, its equipment and functions and interoperability through standardization and equipment certification. Then, through the digitization of copper wires, digital and analogue signals could be replaced by a process communications network. This transformation corresponds to the latest generation of substation automation systems, which is currently associated with digital substations.

The IEC 61850 Standard
Communication has always played a critical role in the real-time operation of power systems [15].
Hence, in the 1970's, the first communication protocols and standards were introduced, being quickly Electromechanical Relays

Static Relays
Digital / Numerical Relays adopted by the power industry to implement efficient control and automation systems [16]. Their main purpose was to ensure open access, interoperability, flexibility, and upgradability, as well as an effective data sharing among different devices and applications.
With the evolution of the secondary equipment, new problems arise due to the fact that there is a variety of manufacturers developing their own devices, which communicate in a wide spectrum of protocols. This resulted in an extremely complex problem for engineers, since many of the protocols used are not interoperable, meaning that the devices are not meant to communicate with each other natively. Furthermore, with the increasing integration of renewable energies in the electric power grid, which introduce a different set of manufacturers, protocols, and electricity capable of disturbing the stability of grid-supplied electricity (e.g., frequency of power supplied), a need for a common standard is of the utmost importance [17].
Additionally, the energy sector is geographically divided by two main standard organizations -the International Electrotechnical Commission (IEC) and the American National Standards Institute (ANSI).
Very often, this was an obstacle for the development of technologies in the field of power system automation. For that reason, the IEC 61850 was issued in 2004 as a global standard for the control and protection systems in electrical substations, covering both the IEC and ANSI standardization models [18].

Legacy Protocols and Standards
Prior Protocol), which is simply the Modbus RTU protocol with a TCP interface that runs on Ethernet for access to the World Wide Web [19].
Regarding the DNP3 protocol, it is a telecommunication standard developed by a division of Harris (which is now owned by GE Energy) in 1993. It was designed to optimize the transmission of data acquisition information and control commands from one device to another, such as RTUs and IEDs, in a substation environment. The DNP3 protocol is very fast and efficient in communication networks, both serial and Ethernet. It also allows for some advanced features such as time synchronization and configuration file transfers [20]. Nowadays, it is used in the electrical, water infrastructure, oil and gas, security as well as other industries worldwide.
IEC 60870 standard defines the communication process between protection equipment and the devices of the control system, including telecontrol (SCADA). As well as simple electrical connection, the standard also defines a direct communication interface implemented over fiber optic cable, suiting the requirements for communication between control centers and substations [21].

Description of the IEC 61850 Standard
The IEC 61850 standard was designed by leading SAS experts throughout the world to simplify the process of automation within electric substations. It is important to highlight that it is not a communication protocol, but a standard that goes beyond describing how data is transferred and received. It defines how data is executed and stored, and it also covers device specifications, such as surge withstand, environmental conditions, electromagnetic interference and other factors [22]. This standard also allows high-speed Ethernet communication (via ethernet or fiber optic cable) in electrical substations and offers an international standardized IED configuration language and data model, providing: [23] • Interoperability -ability of two or more devices from the same vendor, or different vendors, to exchange information and use that information for correct execution of specified functions; • Interchangeability -ability to replace a device supplied by one manufacturer with a device supplied by another manufacturer without making changes to the other elements in the system; • Redundancy -existence of more than one means for performing a required function in a system or device.
Other major benefits associated to the implementation of the IEC 61850 standard, regarding legacy approaches such as Modbus TCP/IP or DNP3, are the following [24] • Free configuration -Any possible number of substation protection and control functions can be integrated at bay level IEDs; • Simpler architecture -Numerous of point-to-point copper wires reduced to just simple communication links, allied with a functional architecture that provides better communication Other operational challenges imposed by this standard's implementation, which need to be carefully addressed, are architecture, availability, maintainability, data integrity, testing, interchangeability, data security, and version upgrade requirements. Furthermore, there are some project challenges to overcome as well, such as the cost and complexity, allocation of the substation functions, system expansion, and manpower training [14]. Still, it has been proven from the increasing substation's availability that the overall benefits largely overtake these implementation challenges.

IEC 61850 Architecture
A typical architecture of the IEC 61850 based SAS, which consists of three levels and two buses, is shown in Figure 2.2.
Connected directly to the primary equipment of the substation (switchyard), current and voltage transformers are placed in strategic points in order to acquire current and voltage values of those specific zones, mainly for protection purposes. These values are treated in the process level by merging units so that the bay level devices can read and treat these values according to their parameterization. A common time reference (normally a GPS signal) is required in order to ensure that every merging unit located throughout the substation is synchronized [11]. The station level includes a gateway (such as a router) which enables remote access to the Network Control Center (NCC), a station server which provides the human-machine interface functionality and communication to the SCADA system [11]. After the data, representing events, alarms, status indication and digital/analog values, is sent to the NCC and archived, it is no longer necessary to keep it in the RAM and it is overwritten by incoming new data [1]. At this level, the status data of various components in the substation are available to operators for monitoring and operation purposes. Operators can also issue signals at this level to perform certain kinds of manual control [25].
Regarding the process and station buses, these are responsible for the complete information exchange between levels using network switches, which are required to network multiple devices in a LAN. This communication is made efficiently since data can be forwarded from one device to another without affecting the remaining network devices. The most popular network switch used today in substations is the ethernet switch [1]. The purpose for the process bus is to enable the time critical communication between the process level and the upper levels, as well as building a bridge for voltage and current values going from MUs to protection IEDs, and for trip signals going the opposite direction.
As for the station bus, it enables information exchange between the bay level and station level, making status data of the entire substation available to the SCADA system for monitoring and operation purposes [25].

Related Work
This dissertation is based on three different works developed in the same department at Instituto Superior Técnico, where different methods for the evaluation of the effects of failures and/or failure modes of the electric power grid equipment in its operation are presented. Thus, in each work, it was possible to perform the intended system reliability analysis and to draw conclusions from the results obtained. Two other studies are also presented, where similar methods and approaches as the ones used in this dissertation are discussed.
In [26], the analysis of the effects of cyberattacks and how the most critical failure modes of the components affect the reliability of smart grids is conducted using reliability block diagrams. First, it was studied the reliability of a traditional electrical distribution grid and its equipment. Then, a communication system was modeled over the traditional grid, allowing a better interoperability between the different equipment, and also enabling the study of cyber-attacks impact in the grid. Results show that transformers are the most critical components of the power grid. The simulations carried out for the smart grid found that a more redundant communications network is a more prudent choice. Regarding cyber-attacks, it was concluded that if they occur in a local control center, the reliability of the network is only affected when they can control the opening of a circuit breaker without any redundancy or all those in parallel connections.
In [27], the application of Failure Modes and Effects Analysis (FMEA) method is introduced in future smart grid systems in order to establish the impact of different failure modes on smart grid performance.
A detailed analysis of all possible failure modes in both substation primary and secondary equipment and their effects on the system are studied. Preventive maintenance tasks are proposed and systematized to minimize the impact of high-risk failures and to increase reliability of the proposed test system. It is concluded that SVs and transformers are the equipment with the most critical failure modes, meaning that their respective high risk causes of failure compromises the correct grid operation. Bus bars failure modes are also identified as critical, in the sense that their impact of failure in the grid is significant.
In [28], the study of the main failures that affect the components of conventional power grids is performed in a first stage. Also, a reliability analysis using the fault tree method to find which components and failure modes are more critical is conducted. In a second phase, cyber system failures are identified and applied to the fault trees built for the conventional grid, allowing the evaluation of the impacts on the overall system. With the development of this work, it was possible to conclude that the studied distribution system is reliable and that the most critical components are the 110 kV cables and the 220/110 kV transformers. It was also verified that the cyber components do not have a major impact on the overall system reliability.
In [6] is proposed a method to assess reliability in digital relays considering the fault data stored in the fault information system to compute the steady state probabilities and state transition probabilities Finally, in [29], a technique to quantitatively analyze the reliability of digital relays is proposed. The technique presents a methodology for mathematical modeling of digital relays, where two failure modes are considered: fail-operation, where digital relays do not act when there is a failure in the primary system (dependability issue) and false-operation, where it operates when there is no failure in the system (security issue). The protection availability and protection economy index are proposed and calculated using the space states method. Results prove that the best method to enforce relay reliability is to improve its self-checking ability.

Fundamentals
The consequences of a failure of a given system are studied within risk assessment which normally encompasses the calculation of its probability of occurrence in time and evaluation of the resulting costs, including equipment replacement, production losses and social impact. The causes for a failure can be categorized according to many different types, whether the system's design is inherently incapable to perform what it is pretended, the environment in which the system operated was beyond its capability, the system or any of its components was not assembled or constructed as the design guidelines required, or it fails due to warn out.
The perception of failures suggests that they arise randomly in time. Based on this perception, it is possible to construct elaborate mathematical models that try to predict probabilistically when the next failure will occur. With this in mind, three important reliability concepts are described [30]: • Failure, that is the incapacity to perform the required service; • Failure mode, that is the effect by which a failure is observed on a failed item; • Failure rate, that is a function of time that represents the rate that failures occur.
The random variable is defined as the time to failure of the system and it can be modelled using a probability density function (PDF), . As Equation (3.1) demonstrates, the probability of a failure prior to the instant is the definition of the cumulative distribution function (CDF), . (3.1) Regarding the reliability of a system, , it is defined as the probability of failure free operation prior to the instant . This definition, given by Equation (3.2), is the probability of success in terms of .

(3.2)
Knowing that the system is operational at instant , the probability of it to fail at some time between and is given by Equation (3.3) [31]. (

3.4)
A rough approximation of the behavior of the failure rate function with time regarding components and systems is given in Figure 3.1, which is known as the bathtub curve due to its characteristic shape.
In the burn-in period, the failure rate is often high, which can be explained by the fact that there may  Taking into account that the reliability of a system at is assumed to be equal to 1, then from which is possible to conclude that    (3.13) Consequently, according to Equation (3.9), when the time to failure is exponentially distributed, the is constant and equal to the inverse of the failure rate , as expressed in Equation (3.14).
Similarly, the is also constant and can be defined as the inverse of the repair rate, , as shown in Equation (3.15); the is described as the number or repairs successfully performed in a given time interval. (3.14) Finally, the instantaneous availability of a component, , is its ability to be kept in a functioning state at specific time . The long run average availability, , is the proportion of time that the system is operating [32] and is expressed in terms of and , or also in terms of and , when both failure and repair time are considered as exponentially distributed, by Equation (3.16). (3.16) Many physical systems are composed of assemblies of many interacting components. These components are often arranged in physical series or parallel configurations for the system to function as intended. However, if one wants to assess the reliability of a particular system, it is important to understand the functional connection between its components, which, in many cases, may not be the same as their physical configuration. In order to comprehend the components' functional connections, it is necessary to evaluate how each component reliability contributes to the success or failure of the Taking into consideration any operating physical system, if it is necessary for all components to be operational for the circuit to work as planned, then the circuit represents a series system from a reliability point of view. A possible representation of this type of system can be observed in Figure 3.3.
The reliability of a series system is easily calculated from the reliability of its components. Let be the probability that component fails. The probability of failure of a system with n components in series, , is then: Regarding parallel systems, represented in Figure 3.4, the system fails only if all its components fail.
In these cases, the probability of failure of a system with n components in parallel, , is: Therefore, the reliability of the parallel system, , can be obtained from Equation (3.20). (3.20)

Birnbaum's Importance Measure
Depending on the complexity Moreover, Birnbaum's critical importance, , measures the probability that a specific component is the cause of a total system failure after a time [33]. Analytically, the critical importance is defined by Equation (3.22).
The critical importance of a component is a valuable tool for establishing priorities in maintenance actions and equipment replacement periods. In fact, due to these possibilities, component prioritization has become a crucial task in today´s electrical market [34].

Markov Processes
The reliability models of some systems are not easy to deal with, since the mathematical models may be complex to solve [35]. Nevertheless, a simple way to model them is by applying a Markov models, which is a stochastic process described by a set of all possible states of the system and the transitions between these states [36]. Markov models can either be discrete, which have probabilistic state transitions occurring at specific time steps, or continuous, characterized by constant state transitions happening continuously in time.
In the field of reliability assessment, Markov modelling is commonly referred to the continuous Markov process, where the state transitions are given by the failure rates, , and the repair rates, . The collection of all possible states, represented by , is called the state space. In most of the applications, is finite and the states correspond to real states of a system. For each state, there is a probability associated to it, , which is the probability of the system being in state at time [31].
Considering a system with two repairable components (A and B), where each component has only two possible states: operational (O) and failed (F) and an associated failure rate ( , ) and repair rate ( , ). Thus, the system will have four possible states, S1, S2, S3, S4 . In Table 3.1 are represented the states of the system regarding the operational behavior of its components. It is assumed that the system can be described by the Markov process represented in Figure 3.5, where circles represent states, arrows represent state transitions and the variables next to arrows represent state transition rates.

Considering a time interval
, which is very small in such a way that the occurrence probability of more than one fault or repair is very small and hence the occurrence of these events can be neglected [35]. Then, it is possible to say that a state transition at time is equal to the same transition at time ( ). Therefore, since each state has an associated probability, the state transition equation can be denoted as follows: (3.23) where represents the state transition between states and , is the transition rate and corresponds to the system's state probability before the transition.
The probability of the system to remain in the same state after a time interval, , is equal to the state probability at time plus the probability of not occurring a state transition and the probability of the system moving from a different state to the one under analysis during . Considering state S1, this probability can be expressed as in Equation (3.24).
(3.24) The minus sign is due to the fact that the transition rate into each state is equal to the sum of the probability of transitioning in from external states minus the probability of transitioning out from the considered state.
As proven bellow, Markov processes can be solved by means of differential equations: (3.25) These differential equations can be represented in matrix form as shown in Equation (3.26).

(3.26)
where is the Markov matrix, also known as transition matrix. is defined in a context of a Markov process that describes the state transition rates of the process between all the possible states. Thus, the set of equations for the system is represented in Equation (3.27). Notice that the off-diagonal entries of are nonnegative and the sum of the elements in each column is zero. These are conditions that must be satisfied for all transition matrices in continuous Markov processes [37]. (3.27) In order to compute the reliability, , of the system, one needs to solve the set of equations to obtain its state probabilities, using any standard technique. Then, because S4 corresponds to the total failure state of the system, the reliability is the sum of the probability of the remaining states. However, since the sum of all state probabilities is equal to one, the reliability can be simply obtained by subtracting the failed state probability, which in this case is , to one, as shown in Equation (3.28). (3.28) One of the most relevant advantages of a Markov process is that it allows a clearer perception about the changes in the states of the system in analysis, while other methods, like RBDs or fault trees, are conducted through a structure with no direct connection to the system's states. However, the complexity of the analysis increases exponentially with the number of components incorporated in the system and with the detail that one wants to apply, since there will be more states to be analyzed. For simplicity reasons, it is not always necessary to analyze all the system's states. It is possible to assume that very similar states, from an operational perspective, can be considered as a single state, also proving this method's flexibility.

Monte Carlo Simulation
The Monte Carlo method is the general designation for stochastic simulation using random variates and it can be used to solve not only stochastic but also deterministic problems. Applications of the Monte Carlo technique can be found in a wide variety of fields such as complex mathematical calculations, medical statistics, engineering system analysis, and reliability evaluation [5]. The fundamental idea behind this method is the development of an equivalent stochastic process that behaves as much as possible as the real system under analysis. Thereby, a large number of simulations are needed in order to cover as much operational scenarios as possible. The process is then observed, and the results are tabulated and treated as if they were experimental data representative of the actual system.
In Monte Carlo simulation, there is a wide range of random variates generation methods. A random variate refers to a random variable following a given distribution. If a random variable follows a uniform distribution between [0, 1], it means that it was generated by a random number generation method.
However, generators of random variates which follow non-uniform distributions are based on uniformly distributed random numbers between [0, 1] and require different methods depending on their variable's distribution. Therefore, it is important to know the behavior and characteristics of the system in order to decide which is the proper random number generation method and where to apply it in the simulation.
Since the constant failure rate model is considered in this work, in which the time to failure is represented by an exponential probability distribution (  where is the inverse distribution function (IDF), which is also a strictly increasing function.
As is a uniformly distributed random variate: where, in the case of this study, is a uniformly distributed random number, is the time to failure of a certain component (following an exponential distribution) and corresponds to the failure rate of the component. This is also applicable for the time to repair, , where is substituted for the repair rate, .
As it is possible to see, the randomness of the Monte Carlo simulation is applied to its most variable factors, which correspond to the failure and repair times of the components. This is understandable, since these times consider various relevant aspects which cannot be easily captured using analytical models, such as the quality of equipment manufacture, the environmental conditions in which the equipment is operating, the different effects of a component's fault, among others. Additionally, as seen in sub-section 3.1, these factors affect mostly the availability of systems, this being the main reason for choosing Monte Carlo simulation for its calculation.
The Monte Carlo procedure adopted in this work can be described by the following steps: 1. The period of simulation is divided into an equal number of time periods called trials; 2. At each trial, a set of random numbers, of size equal to the number of components considered, are generated and confronted against the corresponding probability of component failure or repair during the trial period; • If the component is in operation at the start of the trial and the generated random number is higher than its failure probability, the component will be on a failed state at the end of the trial; • If the component was on a failed state at the start of the trial and the generated random number is higher than its repair probability, the component will be in operation at the end of the trial; 3. The state of the system is determined at each trial depending on its component's states and logical connections; 4. The total operational time of the system is computed and the simulation is then repeated n times in order to obtain the average availability of the system, as given in Equation (3.37): total operational time simulation time (3.37) where is the availability obtained in the th repetition of the Monte Carlo procedure and corresponds to the number of simulations performed.

The Electrical Substation
In this work, the reliability, availability, and operational costs will be computed and analyzed for a substation whose single-line diagram is presented in From a reliability perspective, since the goal is to study the impact of failures of SAS equipment in the operation of the substation, this is a good case study to work with, as external failures (grid related) do not seem to severely impact load supply. This is due to the fact that there is redundancy in the source which supplies the substation, meaning that, if the grid is down, G1 and/or G2 can operate in order to feed the loads and the same if the reverse were to be true. Thus, it is considered that failures occurring outside the substation do not affect the operation and lifespan of the components considered for the reliability analysis.

SAS Equipment Considered
Before starting the reliability analysis, it is necessary to do a survey on the SAS equipment to be considered. Taking into account all the equipment necessary for the protection and control operations of the substation, it was intended to include as many components as possible without adding too much complexity for the reliability analysis.
Starting by the process level, the implementation of Merging Units (MUs) which are responsible for digitizing and sending the current and voltage values to the bay level devices, is required. In previous versions of SAS architectures, IEDs would send trip commands via the process bus to breaker IEDs, which then trip the circuit breakers. However, recent models of MUs already include an actuator control order function, meaning that there is no need for a breaker IEDs in newest versions of SAS architectures.
In the bay level, only protection and control IEDs are considered due to their many capabilities and, consequently, importance in digital substations.
In the station level, a Server (SV) and a Human-Machine Interface (HMI) are mandatory. The SV contains a multi-function software that performs data concentration, protocol translation, automation logic, event file collection and SCADA connectivity. It also enables HMI, which corresponds to a user interface or dashboard that allows constant monitoring (e.g., showing alarms, switch positions, and Normally closed.
Normally opened.
historical data) and control (e.g., circuit breaker opening and closing) for operators in the substation.
Moreover, Time Synchronization Units (TS) are considered in order to ensure that protection and control devices will have accurately timestamped information about the voltages and currents in different parts of the substation, which is crucial when the protection and control device algorithms need to determine whether or not the measured magnitude and/or phase values are outside the operational limits [11].
Regarding the process and station buses, in order to guarantee communication between levels (forming the substation LAN), ethernet switches are required to receive and forward data from the source device to the destination device. In this work, two different types of ethernet switches are assumed: one type for Ethernet Switches (SW) in the process bus and another for switches in the station bus, called Station Switches (Ssw). As for communication links, two types are considered, each one for different connections. Ethernet Cable Links (EL) are assumed for the connections to the server and the HMI, whereas, for the remaining devices, Fiber Optic Links (FO) are considered. In Table 4.1, the failure rate and MTTR considered for each component are presented.

Communication Architectures
The   In a star architecture, as shown in Figure 4.4, each ethernet switch has a direct connection to all station switches, which in turn are connected all together. This configuration, although it requires more connections compared to the ring architecture, does not represent a significant increase in its implementation costs since the number of components, apart from communication links, is the same.
Thus, this architecture provides not only a higher communication redundancy but also the lowest latency in all four architectures.
Finally, the last architecture assumed in this work corresponds to the redundant ring architecture, which provides two completely redundant rings of ethernet switches in the process bus, as presented in   For each architecture, due to the increasing redundancy, an improvement in reliability is expected, as well as in the availability of substation. The results for the economical evaluation, however, are not so predictable, since, although the increasing redundancy of the architectures reduces the costs of penalties, it requires a greater investment in preventive maintenance.

Reliability Analysis Procedure
Following the identification of SAS components to be considered for the reliability analysis, it is important to understand how a failure of one of these components affects the operation of the substation.
Because SAS are part of the secondary equipment, there is no direct relation between the failure of a component and substation inoperability. Nevertheless, since this equipment is responsible for the protection, automation and control functions within substations, a failure in one of these components makes it difficult for primary equipment to operate correctly, which in turn may have a direct cause to the operation of the substation.
In this work it is assumed that failures in SAS equipment will only affect circuit breaker operation, as it is the only primary equipment whose its operation depends on the SAS components.
It is considered that circuit breakers can be affected by two failure modes caused due to SAS failures: • Miss operation -Unintended circuit breaker operation; • Fail to operate -Circuit breaker did not operate when intended.
In order to establish a relation between failures of SAS equipment and substation's inoperability, it is necessary to identify all possible failure modes that affect SAS components. In [27], were identified the causes for each one of the SAS's failure modes, each one of these causes is a possible trigger for one of the previously mentioned circuit breaker faults.
The probabilities of occurrence for each failure mode were based from [27] and are included in detail in Attachment A. Note that, regarding the second circuit breaker failure mode (fail to operate), the average number of times a circuit breaker operates in electric substations must be considered, since these only fail to operate when that intended operation did not occur. Thus, it is necessary to assume an average unscheduled operation rate of circuit breakers so that, when such event coincides with a fail to operate cause, it is possible to know if this failure mode actually occurred. As given in [43], this rate is 1.135 operations per year, and it is related to the number of events leading to unscheduled circuit breaker operation.
It is important to highlight that it is taken into account the position of the ethernet switch in the communication network. This is because the conditions to be met for a system to operate as intended differ depending on the components required for that purpose. As an example, in the cascaded architecture ( Figure 4.2), equipment connected exclusively to SW3 (IED5, IED6 and MU3) is more vulnerable than the ones connected to SW1, since the number of communication links that need to be operational between SW3 and station level components is greater.
Moreover, it is considered the implementation of at least one ethernet switch per busbar and interbusbars (where CB15 and CB16 are located) for all communication architectures in study. With respect to station switches, it is assumed that these are connected to no more than three ethernet switches. For the sake of simplicity, it is also considered that circuit breakers fail exclusively if both its corresponding protection and control IEDs fail, triggering the same circuit breaker failure mode. Regarding time synchronization, it is assumed that only one unit connected to each station switch is sufficient to keep all merging units synchronized.
RBDs were developed for all architectures, considering the failure modes of circuit breakers and the position of the ethernet switch in which their protection and control devices are in the network. These RBDs can be analysed in Attachment B.

Markov Chain of the Substation
Once established the relationship between failures in SAS equipment and the operation of the substation, the following step is to define the Markov chain of the substation to compute its reliability.
For this purpose, four operational states corresponding to the number of loads supplied by the substation are considered. The Markov chain of the substation can be observed in Figure 4.6 and the description of its states is as follows: • S1 -all loads supplied (operational state); • S2 -any three loads supplied (failed state); • S3 -any two loads supplied (failed state); • S4 -no loads supplied (failed state). Then, in order to compute the transition rates between states, RBDs were developed representing the possible combinations of circuit breakers' failure modes that, upon failure, cause the substation to interrupt the power supply. It is considered that there are only three different transition values with respect to load supply failure. These transitions correspond to the same load loss from the substation with respect to the state it is in (e.g., it is considered that the transition between states S1 and S2 is the same as the one between S2 and S3, since the supply of one load was interrupted from the initial to the current state). Therefore, there are equal failure rates, namely and , and , and and . For simplicity reasons, it is considered a MTTR of 24 hours for circuit breakers, independently on the substation's present state.
The developed RBDs are represented in    The Markov matrix, , of the substation can then be defined, as shown in Equation (4.1).
(4.1) This way, it is possible to compute the sate probabilities, according to Equation (3.27), and hence the reliability of the substation, . However, as it is considered that state S1 corresponds to the only operational state of the substation, the reliability is equal to this state probability, , as demonstrated in Equation (4.2).
Finally, the Monte Carlo simulation method is used to simulate events happening throughout the lifetime of the substation and, hence, computing its availability. In this work, it is assumed a substation's lifetime of 100 years, in order to have a broader spectrum of its operation with the purpose of studying its behavior even after the normal lifetime of substations, which is between 40 and 50 years.
In substations, the mindset for microprocessor-based equipment is that it should have a lifetime of 10 years. However, it is unlikely that this replacement cycle is fulfilled. Moreover, substations are sometimes expanded and not very often completely replaced. For that reason, a more accurate replacement cycle for this equipment is 15 or even 20 years [44].
In this work, is assumed that all SAS equipment are replaced every 20 year.

Preventive Maintenance Plans
Two major types of equipment maintenance can be highlighted: corrective and preventive maintenance. As the name implies, corrective maintenance is carried out to repair malfunctions of any equipment as soon as they occur. If the failure that interrupts power supply has not been anticipated by other types of maintenance, specialized technicians deal with the problem as soon as it occurs. As for preventive maintenance, it is performed before any failure or malfunction occurs. It is often limited to a visual inspection, contact cleaning, and check of the system's memory buffers in order to reduce the risk of failure, and hence increase the reliability of the system [45]. Furthermore, as technology evolves, there is a tendency to increase preventive maintenance period. In order to define preventive maintenance plans, standard values of SAS equipment's maintenance periods are required. According to [45], maintenance periods for the components in study are situated between one and 12 years, with an average of 3.5 years. Thus, for the first plan (Plan 1), in this work is proposed a single maintenance period of 5 years for all components. The importance criteria used for the maintenance periods applied to each component was different for Plan 2 and Plan 3.
In Plan 2, a threshold was defined. Any component whose critical importance value after 20 years is greater than the median of the components at the same time instant is considered to be more important.
Consequently, a component is considered to be less important if its importance value is lower than the defined threshold. Hence, Plan 2 was design considering two different maintenance frequencies: the most important components according to Birnbaum's importance measure will be maintained every 4 years and the remaining components every 6 years.
Regarding Plan 3, the criterion used was simpler. After a lifetime of 20 years, the three components with the highest and lowest importance values are considered to be critical and least critical respectively.
The remaining components are considered to have intermediate criticality. Therefore, Plan 3 was design considering three different maintenance frequencies: the most critical components will be maintained every 3 years, the least critical important components are maintained each 7 years, and components with intermediate criticality importance are maintained every 5 years.
The maintenance periods of each component over the maintenance plans are presented in Table   4.2, for each of the communication architectures.

Plan 3
Since     In this study, it is assumed a discount rate of 7%. Furthermore, investment, maintenance, and unavailability costs are considered for the economic evaluation.

Reliability, Availability and Total Costs
For better clarity in the presentation and understanding of the results, this subchapter is distributed between substation's reliability, availability, and costs. Then, for each one of these factors, the results obtained for the different architectures (cascaded, ring, star, and redundant ring) are presented.
Starting with the reliability analysis of the substation, the most obvious observation is its high value.
This is an expected result, since this work focuses on the study of the impact of failures of electronic equipment in substations, which are characterized by considerably low failure rate values. In addition, there is an improvement in reliability with the implementation of the defined maintenance plans. In turn, reliability becomes increasingly higher as the maintenance plan becomes more detailed. In the same sense, the behavior of its curve tends to converge in a straight line, resulting from the increasing number of maintenance periods in each plan. It is possible to observe these phenomena in all the architectures considered, with the only difference being the reliability value over time, which increases according to the increase in redundancy offered by the architectures. This can be seen in Figure 5  For all maintenance plans and communication architectures considered, the curve of the mean value and variance of the total discounted costs seems to converge to a certain value. This happens due to the monetary value of time, and so the amount to be paid relative to the present decreases throughout the lifetime of the substation.
Once again, it is possible to see an improvement in the total discounted costs, as these decrease with each maintenance plan. Furthermore, there is an increase in variance over time, which is expected since the uncertainty in the operational behavior of the substation increases in the same way. However, the most interesting result of this analysis is that there is no reduction in the total discounted costs as the redundancy of communication architectures increases, since it is the star architecture with the best results in economic terms.
The results shown previously are summarized in Table 5.1. The variance values shown for both availability and total discounted cost correspond to the maximum obtained from the results. Nevertheless, the implementation of the star architecture along with the execution of preventive maintenance plan 3 is suggested, which corresponds to the option with the best trade-off between availability and total discounted costs.

Equipment Degradation Approach
In the case of a real system in operation, it is not expectable that the failure rate of a certain equipment remains exactly the same throughout its lifetime, even if it is maintained regularly. Moreover, the field of reliability physics deals with the subject of planned obsolescence of equipment. This term describes a strategy, adopted by many companies nowadays, which consists of deliberately ensuring that the current version of a given component will become out of date or even inoperable within a certain amount of time. This guarantees the need for consumers to seek replacements in the future, thus increasing demand. Therefore, an alternative analysis considering equipment degradation is performed.
The purpose of this analysis is to defy the useful life period of the bathtub curve from In this case, the useful life period is extended, as shown in Figure 5. 10 Results show that the impact of the proposed approach is significant both in operational and economical terms. The decrease in availability relatively to the scenario without equipment degradation is evident, which is translated in a considerable increase in the total discounted costs of the substation.
However, in this case the architecture which presents the best trade-off between availability and total discounted costs is the redundant ring architecture, with the application of preventive maintenance plan 3. Since the availability reduces, the penalties for energy not supplied increase significantly, which is determinant for the difference between the two models.  The main contribution of this work is the proposal of a method which allows the identification of opportunities for companies in the electricity sector to involve themselves in the design and construction of digital substations, working towards a safer, reliable and economically efficient operation. Hence, this work can be helpful in the decision of which communication architecture shall be carried out in any substation. This decision shall be taken considering how critical the substation is for the power system and the economic resources available at the time.
The use of the Markov-Monte Carlo simulation method reveals to be a good way for the reliability and availability analysis of any system. The combined advantages of both processes ensure: • Simpler interpretation -Markov processes provide a clearer perception of the states of a given system and its transitions; • Flexibility -It is possible to integrate other methods, such as RBDs or fault trees, in the reliability analysis; • Accuracy -Monte Carlo simulation is based on the development of an equivalent stochastic process that behaves as much as possible as the real system under analysis.
However, the use of this method can cause an exponential increase in complexity with the number of components incorporated in the system and with the detail that one wants to apply.

Future work
The work presented in this dissertation was developed with the intention of serving as a basis for future work in the field of electric power system reliability. Although all efforts were made to make this work as detailed as possible, there are some aspects that can be discussed and improved in future work In terms of future work, a more detailed analysis on the calculation of state transitions (both failure and repair rates) of the developed Markov chain is suggested. This is because it is not expected that there are equal transitions, as considered in this work. Additionally, in order to create a model that adapts to the evolution of each component's state and its importance to the operation of the circuit breakers, it may be beneficial to plan more dynamic preventive maintenance by increasing the frequency of maintenance over time. Moreover, considering that only a cost analysis was performed in this work, it may be useful to perform a more complete economic evaluation by computing the Net Present Value (NPV) and Internal Rate of Return (IRR). For this, it will be necessary to consider a way of financial return generated by the investment applied, which may be, for instance, related to the increase in availability or with the speed in service restoring. An analysis of the number of times each component that, upon failure, is responsible for the interruption of power supply from the substation would also be interesting in order to find vulnerability sources. Finally, the proposed approach considering equipment degradation can be studied further by integrating the field of reliability physics. The purpose is to increase the detail in reliability analysis, which will encourage the development of new models and approaches, allowing more realistic results to be obtained.

RBDs of the Architectures Cascaded Architecture
• Miss operation Failure Mode: