Fast Reliability Assessment of Neutral-Point-Clamped Topologies through Markov Models

— This paper presents detailed Markov models for the reliability assessment of multilevel neutral-point-clamped (NPC) converter leg topologies, incorporating their inherent fault-tolerance under open-circuit switch faults. The Markov models are generated and discussed in detail for the three-level and four-level active NPC (ANPC) cases, while the presented methodology can be applied to easily generate the models for higher number of levels and for other topology variants. In addition, this paper also proposes an extremely fast calculation method to obtain the precise value of the system mean time to failure from any given formulated system Markov model. This method is then applied to quantitatively compare the reliability of two-level, three-level, and four-level ANPC legs under switch open-circuit-guaranteed faults and varying degrees of device paralleling. The comparison reveals that multilevel ANPC leg topologies inherently present a potential for a higher reliability than the conventional two-level leg, questioning the suitability of the traditional search for topologies with the minimum number of devices in order to improve reliability. Experimental results are presented to validate the fault-tolerance assumptions upon which the presented reliability models for the three-level and four-level ANPC legs are based.


I. INTRODUCTION
ELIABILITY of power electronics systems has become of primary importance to fully leverage the advantages that this technology offers [1]- [3]. In many applications, the power electronics subsystem is one of the weakest links from the reliability point of view and an unexpected sudden full system shutdown is not acceptable.
Reliability research has traditionally focused on two main areas: modeling and methods to improve reliability.
On the one hand, a significant effort has been devoted to the development of reliability models. At the component level, two types of models can be highlighted: empirical models such as the Arrhenius-Coffin-Manson model [4]- [5] or the Palmer-Miner linear cumulative model [5], and physics-offailure models. At the system level, part-count models, combinatorial models (fault trees, success trees, and reliability block diagrams), and state-space models (Markov models) have been proposed. Artificial neural network models have also been employed to ease the introduction of the reliability metrics into the design of power electronics systems [6].
On the other hand, several methods have been proposed to enhance the reliability of systems, that can be broadly categorized as: 1) using more suitable materials, shapes, and processes in the component and system implementation; 2) methods based on the system operation management, such as active thermal management and preventive maintenance supported by condition monitoring [7] and fault prognosis; and 3) methods based on increasing the redundancy of systems, at both the component and system level, tied to fault diagnosis and resulting in fault-tolerant systems.
One approach to increase the redundancy of a power converter is to employ multilevel topologies, since many of them present inherent redundancy. The reliability of several multilevel topologies has been studied in [8]- [16], including their reliability modeling and the strategies to operate them under faults. In particular, in [8], the reliability of some multilevel converters is modeled, assuming power device short-circuit faults, through reliability block diagrams, from which the system reliability can be obtained as a somewhat complex function of time. It is concluded that multilevel converters can present a higher reliability than a conventional two-level converter over an initial period of time. In [9], Markov models are used to analyze and compare multilevel inverters. Nevertheless, they do not consider the inherent topology fault tolerance; e.g., that neutral-point-clamped (NPC) converters can continue operating under multiple opencircuit switch faults [10]. Therefore, they are oversimplified models considering only two system states. In summary, the literature lacks detailed Markov models accounting for multilevel converter inherent fault-tolerance, despite being the most powerful models at the system level. Generating such Markov models is recognized to be a challenge [1].
To contribute to fill this gap, this paper derives Markov models to characterize the reliability of multilevel NPC topologies under open-circuit faults and, from these models, proposes a fast method to compute the mean time to failure (MTTF) of these topologies, which enables the use of the MTTF as a figure-of-merit to quickly characterize the Fast Reliability Assessment of Neutral-Point-Clamped Topologies through Markov Models Sergio Busquets-Monge, Roya Rafiezadeh, Salvador Alepuz, Alber Filba-Martinez, and Joan Nicolas-Apruzzese reliability of multiple topology options in optimization processes. The paper is organized as follows. Section II reviews the basics of reliability and Markov models. Section III proposes a fast method to compute the MTTF from a system Markov model. Section IV presents the Markov models of two-level, three-level, and four-level active NPC (ANPC) legs with a variable number of parallel switches per position, and performs an MTTF comparison of these topologies under several degrees of paralleling and simple conditions to explore their inherent reliability features. Section V presents experimental results to illustrate the behavior of multilevel ANPC legs under several concurrent open-circuit switch faults. Finally, Section VI outlines the conclusions.

II. BASICS OF RELIABILITY
The reliability of a system at time t, R(t), is defined as the probability that the system is still operating at time t. If a relatively large set of equal systems are tested in parallel over time, it can be calculated as where N S (t) is the number of systems still operating at time t (i.e.; N S (0)N S (t) systems have already failed at time t).
A convenient figure of merit to assess and compare the reliability of different systems is the mean time to failure, defined as Another important parameter is the failure rate, defined as The failure rate at time t indicates the probability of a system failure in the next unit of time. It is usually expressed in FIT, where 1 FIT = 10 ˗9 /h. The failure rate usually depends on the operating conditions.
If λ(t) is constant over time and equal to λ, then (4) and 1 ⋅ A very convenient tool to study the reliability of a complex system is its Markov model, represented by a diagram known as a Markov chain. In a Markov chain, the different relevant system states are represented as nodes, and the possible transitions between states are represented by arrows with an associated transition rate. For example, Fig. 1 presents the simplest Markov chain, with only two states: state 1, in green, representing the system in its original state, with no internal failed devices, and state 0, in red, representing the system under failure; i.e., the state of the system once it has stopped operating because of the failure of one or more internal devices. The simple diagram of Fig. 1 is appropriate when any device failure within the system leads to a full system failure. However, when certain device failures within a system lead to states where the system can still operate, although under limited conditions, additional states have to be included in the Markov chain. This is the case of the example system represented in Fig. 2 in Fig. 1, and according to in Fig. 2. In a general case, this equation can be formulated as where p(t) is the vector of state probabilities. Assuming that all transition rates among states are constant, the vector of probabilities can be obtained as The system reliability can then be computed as 1 (10) and (2) can then be used to calculate the system MTTF. The system failure rate can be calculated as

III. PROPOSED FAST METHOD TO COMPUTE MTTF
The calculation of the system reliability presented in the previous section, based on the system Markov model, is aimed at obtaining the value of the system reliability at each point in time, and involves certain computation complexity. However, to assess the reliability of a system, an average-type figure-ofmerit such as the MTTF is often enough. It would then be very interesting to find a way to compute the MTTF without the need to compute R(t). This can be done as follows. The procedure will be illustrated with the example system of Fig.  2. Let us assume that anytime the red system failure state is reached, the system is immediately fully repaired and returned to the green initial state; i.e., an infinite repair rate is 1 λ1,0 0 considered. The new Markov chain diagram is illustrated in Fig. 3. In this situation, the probability of being in each state will reach a constant steady-state value (P 0 , P 1 , P 2 , and P 3 ), with P 0 = 0 and P 1 + P 2 + P 3 = 1 because the system is immediately repaired when it fails and the system must always be either in state 1, state 2, or state 3. In addition, in this steady state, the transitions into each yellow intermediate state must equal the transitions departing from the same intermediate state. All the above can be formulated as , , From (12), the vector of probabilities can be easily isolated , , The system failure rate can then be computed as Finally, since λ sys is constant over time, the MTTF can be easily calculated as 1 The obtained MTTF value through this simple procedure is exactly the same as the one obtained through the cumbersome procedure presented in Section II. Thus, equations (12)- (15) greatly simplify the calculation of the system MTTF. Table I shows that the computation time of the MTTF in this simple example can be reduced more than 500,000 times using the proposed procedure (see supplementary MATLAB script).
In a general case, the equations describing the proposed computation procedure can be formulated as where P = [P 1 P 2 … P m ] T and the last row of matrix A contains all ones.

IV. APPLICATION TO MULTILEVEL NPC TOPOLOGIES
In this section, the previously conceived method will be applied to compute the MTTF of a conventional two-level leg and its extension into multilevel ANPC legs. The resulting MTTF values will be compared under different scenarios.
In the aforementioned study, it will be assumed that all power semiconductor devices always fail in open circuit because this is the most favorable situation from the reliability point of view. In practice, although power semiconductor devices may fail in both short circuit and open circuit, by adding some auxiliary circuitry acting as an electronic fuse in series with the power semiconductor, it can be guaranteed that the compound device formed by this series association ends up failing in open circuit. This compound device, designated here as switching cell (SC), is conceptually illustrated in Fig. 4, where S m represents the main switch and S a represents an auxiliary switch connected in series to perform the electronic fuse function. S a should be a very reliable low-conduction-loss switch which is always ON and whenever a failure of S m is detected, S a turns permanently OFF. Thus, the overall SC can be regarded as a single switch which always fails in opencircuit. The good performance of such configuration is proved and discussed in detail in a separate future publication. A. Two-Level Leg Fig. 5(a) shows the topology of the conventional two-level leg. It contains two switches, labelled with a two-digit code indicating the row and column where the switch is located. The two switching states allowing the connection of the leg ac terminal to the two dc-link terminals are represented in the first row of Fig. 6. In each switching state, ON switches are represented with a solid line and OFF switches are represented with no line. The line representing an ON switch is red to indicate that the switch connects the ac terminal with the intended dc terminal and carries the ac terminal current.
In the conventional two-level leg, the open-circuit failure of any switch leads to a full system failure, because once a switch fails, the leg can no longer switch between the two dclink points; thus, the leg loses its essential functionality: being capable of connecting the ac terminal to more than one dc-link point. This is indicated in the Markov chain diagram of Fig. 7, where λ 11 and λ 21 represent the failure rate of switches 11 and 21, respectively.
B. Three-Level ANPC Leg Fig. 5(b) shows the topology of the three-level ANPC leg. It contains six switches, labelled with a two-digit code indicating the row and column where the switch is located. The three switching states considered for the connection of the leg ac terminal to the three dc-link terminals are represented in the second row of Fig. 6 [17]. Again, in each switching state, ON  Markov chain diagram for the three-level ANPC leg. This systematic search and analysis is performed in one of the supplementary MATLAB scripts provided with this manuscript. Fig. 8 presents the resulting Markov chain diagram, with 17 relevant states. Each state contains a diagram indicating the state of each switch of Fig. 5(b): a green dot indicates that the switch is operating correctly and a red cross indicates that the switch has failed in open circuit. The number of available levels is also indicated for each state. It can be observed that the failure of only one switch does not lead in any case to the full system failure state, because in all these cases the leg ac terminal can still be connected to more than one dc-link point. It is interesting to note that in states 3 and 4, the three levels are still available. The concurrent failure of two switches may also not lead to a full system failure. Only when two or three concurrent switch failures occur with 1 or 0 levels available, the system reaches the system failure state (state 0). Fig. 5(c) shows the topology of the four-level ANPC leg. It contains twelve switches, labelled with a two-digit code indicating the row and column where the switch is located. The four switching states considered for the connection of the leg ac terminal to the four dc-link terminals are represented in the third row of Fig. 6 [17]. Again, in each switching state, ON switches are represented with a solid line and OFF switches are represented with no line. The line representing an ON switch is red when the switch connects the ac terminal with the intended dc terminal and carries a portion of the ac terminal current, and it is green when the ON switch simply clamps the blocking voltage of OFF state switches to the elementary value v dc /3 and carries no current.

C. Four-Level ANPC Leg
Through a systematic search and analysis of all possible combinations of failed devices, it is possible to establish the Markov chain diagram for the four-level ANPC leg. This systematic search and analysis is performed in one of the supplementary MATLAB scripts provided with this manuscript. The Markov chain contains 1,118 relevant states. Note that the complexity of the Markov chain raises exponentially with the leg number of switches. From the analysis of these states, it can be concluded that the leg can continue operating with up to eight concurrent open-circuit switch failures; i.e., in four different states featuring eight concurrent failed switches the leg is still able to connect the ac terminal to two different dc-link points. On the other hand, the full leg failure state can be reached if both 33 and 43 switches fail; i.e., with only two concurrent switch failures. This means that these two positions should be occupied by switches featuring a low failure rate.

D. Devices in Parallel
One way to improve the reliability of systems involving switches that fail in open circuit is to introduce additional switches connected in parallel with the preexisting ones. By introducing additional switches in parallel with a given switch, the current is distributed among the paralleled devices, reducing the current stress and therefore reducing the failure rate of each individual switch. In addition, when one of the paralleled switches fails, the remaining parallel switches can continue operating, with eventually higher current stress and a higher failure rate. Ultimately, the overall failure rate of the set of paralleled devices ends up being lower than the failure rate of a single device. The reduction is especially noticeable for a moderate number of paralleled devices [11]. Let us analyze the reliability of an isolated system integrated by pr parallel devices. Fig. 9 illustrates the Markov chain of this system, with pr varying from 1 to 5. State k corresponds to the set with k1 failed devices, except for state 0, which corresponds to the state will all devices failed. Variable λ q indicates the failure rate of one device when q devices are operating in parallel. From Fig. 9 and applying the method presented in Section III, the expressions in (17) of the overall system failure rate λ sys,pr are quickly obtained.
The paralleling of switches can be used to improve the reliability of ANPC legs. Their MTTF will especially improve if paralleling is applied to the most critical switch positions. If paralleling is used in a given position of the leg, the new system reliability can be estimated, in a first approximation, using the same Markov chain diagrams derived in Sections IV.A, IV.B, and IV.C, and setting that the failure rate of the corresponding switch position is equal to the value computed with (17).

E. MTTF Comparison Study and Discussion
The models presented in the previous sections and the proposed fast method to compute the MTTF can be very useful to characterize and compare the reliability of different systems. Thanks to the low MTTF computation time, it can be incorporated in optimization procedures requiring the evaluation of many alternative designs. In this section, a comparative study of multilevel ANPC topologies is presented to illustrate the potential of this modeling approach. The study provides a deeper insight into the reliability of these systems and reveals results that refute commonly accepted beliefs.
It is commonly accepted that a topology with a larger number of devices leads to a worse reliability. Among other reasons, this belief has motivated the search of multilevel topologies with a reduced number of devices. This belief is based on the usual case of systems where any device failure leads to a global system failure. Nevertheless, if the topology has inherent redundancy and the failure of one device does not lead to a global system failure, this is not necessarily true.
Let us analyze and illustrate this with a comparison of the MTTF of ANPC topologies with two, three, and four levels and the same full dc bus voltage employing the reliability models and fast MTTF calculation method presented in previous sections. For the sake of simplicity, the failure rate of the SCs will be roughly estimated with a simple normalized value. In a more accurate modeling approach, a converter leg thermoelectrical model should be employed in each possible state to compute each SC temperature, and then, conventional reliability model expressions, based among others on the Arrhenius law, would determine the failure rate of the SC from the value of the SC temperature, the SC blocking voltage, and other predetermined parameters, as discussed in section II.C of [8]. However, this is not deemed convenient in this preliminary study, whose only aim is to explore basic comparative features and trends.

1) Comparison 1: comparison at a SC failure rate independent of voltage rating
The first comparison assumes a normalized value of the failure rate λ = 1 for all SCs over their whole life, regardless of their operating conditions and voltage rating; i. e., it is assumed that all SCs within a leg suffer the same stress in all operating conditions over the leg lifetime and that SCs with different voltage rating have the same reliability. Obviously, this is not realistic, but allows us easily obtaining a first approximation of the leg MTTF that will reflect its inherent reliability and thus allows classifying the different possible leg configurations from the reliability point of view.
In these conditions, the failure rate of a set of pr SCs in parallel is shown in the left column of Table II. Leg configurations with two, three, and four levels, and a degree of paralleling from 1 to 5 in each leg switch position have been analyzed. The MTTF value for each configuration is shown in Table III and plotted in Fig. 10, where n is the leg number of levels and #SC is the total number of SCs within the leg.
It can be observed that without paralleling (pr = 1), the MTTF increases with the number of levels, in spite of increasing the number of SCs with the number of levels. This also occurs for the same level of paralleling (pr = 2, 3, 4, and 5). This is due to the fact that as we increase the number of levels, the topology has an inherent higher degree of redundancy: redundancy in the dc bus points to which the ac terminal can be connected and redundancy in the paths that allow the connection of the ac terminal to these dc bus points.
It can also be observed that if the use of paralleling is considered and topologies at different levels but with the same number of SCs are compared, then, as the number of levels  increases, the MTTF decreases. For instance, the MTTF of n = 2 with pr = 3 is higher than the MTTF of n = 3 with pr = 1. However, this is not a fair comparison, because the SCs used in a leg with a higher number of levels should be simpler and more reliable because they have a lower voltage rating. Let us take this aspect into account in a second comparison, which should be fairer.

2) Comparison 2: comparison at a SC failure rate proportional to the voltage rating
This second comparison is performed under the same assumptions of the first one, except that now the SC failure rate is assumed to be proportional to the SC voltage rating, with a normalized value λ = 1 for the cells used in a two-level leg.
In these conditions, the failure rate of a set of pr SCs in parallel for an n-level leg is shown in the right column of Table II.
The new MTTF values for the different analyzed leg configurations are presented in Table III and Fig. 11.
It can be now observed that the MTTF increases as the number of levels increases, even if the comparison is made for the same number of SCs.

V. EXPERIMENTAL RESULTS
In this section, experimental results are presented to illustrate the behavior of multilevel ANPC legs under certain open-circuit switch faults, and most specifically, to validate the assumption that multilevel ANPC legs can continue operating under the concurrent open-circuit fault of some switches.
A three-level ANPC leg ( Fig. 5(b)) and a four-level ANPC leg (Fig. 5(c)) have been implemented with 100-V metaloxide-semiconductor field-effect transistors and then tested with 50 V dc power supplies across adjacent dc-link terminals, a series 33 Ω -3 mH load connected between the ac and dc 1 terminals, the same duty-ratio of connection to all available dc-link points, and a switching frequency of 10 kHz. The switch control signals are generated with the aid of a dSpace control platform equipped with a DS5101 digital waveform output board. Fig. 12 shows the voltage of the ac terminal with reference to node dc 1 (v ac ), the current through the ac terminal (i ac ), and the binary switch control signals (high level: switch ON, low level: switch OFF) in the three-level leg case under several fault states. In the first and second switching cycles, the leg is in a failure-free state and produces a v ac with three levels, as expected. At the beginning of the third switching cycle, switch 21 fails but v ac still presents three levels, because the failure of switch 21 only eliminates one of the two redundant paths to connect to dc 2 , according to Fig. 6. At the beginning of the fourth switching cycle, switch 41 fails leading to a v ac with only two levels, since the failure of switch 41 eliminates the only path to connect to dc 3 (see Fig. 6(c)). Finally, at the beginning of the fifth switching cycle, switch 32 fails, eliminating the remaining path to connect to dc 2 , and thus leading to a full leg failure state since v ac can no longer present more than one level. Fig. 13 shows v ac , i ac , and relevant switch control signals in the four-level leg case under several fault states. In the first and second switching cycles, the leg is in a failure-free state and produces a v ac with four levels, as expected. A sequence of switch failures (21, 41, 32, 61, 52, and 43) occurs during the next switching cycles, as indicated in Fig. 13. According to Fig. 6, the first three failures (switches 21, 41, and 32) only eliminate part of the redundant paths to connect to the dc points, and thus v ac preserves four levels. The failure of switch 61, eliminates the only path to connect to dc 4 , and v ac presents three levels. The failure of switch 52, eliminates the remaining path to connect to dc 3 , and v ac presents two levels. Finally, the failure of switch 43 eliminates the remaining path to connect to dc 2 leading to a full leg failure state since v ac can only present one level.

VI. CONCLUSION
A very fast method to compute the MTTF of power electronics systems based on their Markov model has been presented. Assuming the use of SCs designed to always fail in open-circuit, the Markov chain diagrams of two-level, threelevel, and four-level ANPC converter legs have been derived considering for the first time all possible intermediate states and the proposed MTTF computation method has been employed to conveniently compare their MTTF with different number of parallel devices in each switch position. The comparison reveals that topologies with a larger number of devices may feature higher reliability if they present enough inherent redundancy, proving wrong the common belief that the higher the number of devices the lower the reliability.
The proposed Markov models of multilevel ANPC legs and the proposed fast MTTF computation method enable the search through optimization algorithms of the most reliable leg configuration under specific operating conditions, where a large number of different configurations have to be evaluated.