Transfer Learning-Motivated Intelligent Fault Diagnosis Designs: A Survey, Insights, and Perspectives

Over the last decade, transfer learning has attracted a great deal of attention as a new learning paradigm, based on which fault diagnosis (FD) approaches have been intensively developed to improve the safety and reliability of modern automation systems. Because of inevitable factors such as the varying work environment, performance degradation of components, and heterogeneity among similar automation systems, the FD method having long-term applicabilities becomes attractive. Motivated by these facts, transfer learning has been an indispensable tool that endows the FD methods with self-learning and adaptive abilities. On the presentation of basic knowledge in this field, a comprehensive review of transfer learning-motivated FD methods, whose two subclasses are developed based on knowledge calibration and knowledge compromise, is carried out in this survey article. Finally, some open problems, potential research directions, and conclusions are highlighted. Different from the existing reviews of transfer learning, this survey focuses on how to utilize previous knowledge specifically for the FD tasks, based on which three principles and a new classification strategy of transfer learning-motivated FD techniques are also presented. We hope that this work will constitute a timely contribution to transfer learning-motivated techniques regarding the FD topic.

Abstract-Over the last decade, transfer learning has attracted a great deal of attention as a new learning paradigm, based on which fault diagnosis (FD) approaches have been intensively developed to improve the safety and reliability of modern automation systems.Because of inevitable factors such as the varying work environment, performance degradation of components, and heterogeneity among similar automation systems, the FD method having long-term applicabilities becomes attractive.Motivated by these facts, transfer learning has been an indispensable tool that endows the FD methods with self-learning and adaptive abilities.On the presentation of basic knowledge in this field, a comprehensive review of transfer learning-motivated FD methods, whose two subclasses are developed based on knowledge calibration and knowledge compromise, is carried out in this survey article.
Finally, some open problems, potential research directions, and conclusions are highlighted.Different from the existing reviews of transfer learning, this survey focuses on how to utilize previous knowledge specifically for the FD tasks, based on which three principles and a new classification strategy of transfer learningmotivated FD techniques are also presented.We hope that this work will constitute a timely contribution to transfer learningmotivated techniques regarding the FD topic.

A. Brief Concepts of FD Techniques
Actually, FD is a general concept encompassing several categories, such as fault detection, fault isolation, fault identification, and fault reconstruction [32].Among these goals, fault detection is easiest in an FD task, whose purpose is to trigger alarms when system performance is negatively affected by faults [38].Fault isolation provides more information than detection, which can locate faults of interest [11].Besides, fault identification categorizes the type of faults and finds their roots [39].Fault reconstruction/estimation is the process of estimating fault amplitude using the idea of redundancy [4].It is worth mentioning that fault location and identification are different from the so-called classification problem in the machine learning field [7].An initiative explanation is, especially for dynamic systems, that fault matrices signify fault directions that are independent of their amplitudes.It means that the faults with different amplitudes should not affect the FD results such as fault location and type.Roughly speaking, the fault matrix is an unknown parameter that should be identified and estimated.Therefore, identifying fault matrices is necessary for FD tasks, rather than simple classifications [1].For understanding FD topics in-depth, a schematic description of conventional designs is shown in Fig. 1.
Due to the rapid developments of modern control theory, model-based FD techniques have been one of the main streams since the 1980s [37].The precondition of their implementations is that system physical and mathematical knowledge must be known a priori [44].Because of numerous notable advantages such as the simple forms and easy implementations, data-driven FD methods have witnessed their popularity over the past two decades.By using a large amount of historical data, the multivariate analysis technique has its predominant ability to address FD issues for primarily static systems [45]; on the other hand, system identification is a powerful tool when considering dynamic systems [46], [47].The development of FD methods dealing with a time sequence can be found in [32] and [48].
A retrospective review can find that the FD techniques developed in the past two decades are dramatically changed with the development of machine learning, computer science, big data, sensor technology, and so on [49].Now, these FD techniques are evolving into more intelligent and broadly applied tools.A general name can be called intelligent FD that can cover all FD techniques.

C. Contributions and Structure
Nevertheless, considerable variations (including the timevarying environment, different operation modes, dissimilarities among homogeneous systems, and performance degradation) may make the previously established FD models disable [11].As a result, reasonable adjustments of system modeling and feature extraction [50] are necessary for successful FD tasks, which has catalyzed lots of the transfer learning-motivated FD strategies over the past one decade [51], [52], [53], [54], [55], [56], [57], [58], [59], [60].Based on the current development, the main feature of transfer learning-motivated FD methods may be summarized as follows: by sufficiently and effectively utilizing transferable knowledge from the source domain, less data from the target domain is used to identify other necessary knowledge to further form residual generation.The main advantage of transfer learning-motivated FD methods lies in simple identification procedures and lower computation loads, especially when the training samples in target domains are insufficient.
In order to provide a timely and comprehensive body of knowledge on the transfer learning-motivated FD methods, we intend to complete this review with their essentials, limitations, and perspectives from generalized insights.The main contributions are fourfold, as given in the following.
1) The first attention is to revisit basic knowledge of FD and transfer learning.Essentially, residual generator can be regarded as a cornerstone in FD research.These fundamentals can provide researchers with instructive and valuable guidance when transfer learning is incorporated into FD designs.
2) The second emphasis is reclassifying transfer learningmotivated FD methods according to the core concepts behind these approaches.This survey details a novel sketch that covers both knowledge calibration and knowledge compromise.
3) The third attempt is, studied from the up-to-date FD research, to make a purposeful discussion from generalized insights.It is vital to realize that transfer learning is a paradigm rather than a new learning technique.These insights elaborately enumerate its embryonic forms that are widespread in practical FD applications.4) The final contribution is, benefiting from the ultimate cognition from both theoretical and application aspects, to delineate challenges and possible research opportunities on transfer learning-motivated FD methods.Even though signal processing-based techniques are also popular in detecting and diagnosing faults for mechanical components, the main focus of this review is on model-based and data-driven FD methods (that utilize machine learning, system identification, control theory, and so on) motivated by transfer learning.In fact, from the viewpoint of knowledge transfer and reuse, the main results presented in this survey are still valid for FD in the mechanical field.
The rest of this survey is organized as follows.The basic knowledge about FD and transfer learning is introduced in Section II, whose purpose is to expound on several positions where transfer learning is necessary.Section III details the transfer learning-motivated FD methods, where knowledge calibration and compromise are systematically reviewed according to three principles.Several insights about transfer learning-motivated FD methods, including bridges, challenges, and perspectives, are elaborately enumerated in Section IV.Section V presents a real-world application of transfer learning used in FD.Section VI ends this article with the conclusions.

II. BASIC KNOWLEDGE AND RECENT DEVELOPMENTS
A complex automation system can be described by using a linear or nonlinear model.After the presentation of system models, this section introduces transfer learning and the scenarios in which transfer learning can be applied.
A. Descriptions of Automation Systems 1) Linear System Models: Fig. 1 presents conventional FD design procedures that are suitable for both linear and nonlinear systems.Based on the first principles [32] or system identification [46], a linear dynamic model can be described by where system matrices (A, B, C, and D) have appropriate dimensions; x ∈ R k x , u ∈ R k u , and y ∈ R k y are the system state, input, and output, respectively; w ∈ R k x and v ∈ R k y are unknown noises; f stands for the fault amplitude; and F a and F s are system matrices corresponding to actuator and sensor faults, respectively.When the system works at a steady state, Model II: where from a generalized point of view, u can be regarded as one set of variables and y is the other set of variables [7]; F is the fault matrix; h is the so-called hidden variable; P 1 describes a linear relationship between u and y; and P 2 is the linear operator projecting from original data space to feature space (or called latent variables) [61].
2) Nonlinear System Models: For a nonlinear automation system, its dynamic model is given by Model III: where and are nonlinear operators.If the system (3) is degraded by actuator (or sensor) faults, (or ) becomes a (or s ).Considering the steady-state condition, (3) can also be written as where P 1 denotes the nonlinear relationship among variables, P 2 projects original measurement data nonlinearly into feature space [62], and F is a nonlinear projection of f .It is worth noting that for data-driven FD designs of dynamic systems, the input-output (I/O) data model using stack vectors and matrices is preferred in practice.It is fundamental for system identification-based FD methods [11].More details can be found in surveys [7], [16], [46] and monographs [4], [17].

B. Revisiting Transfer Learning
Transfer learning, or known as knowledge transfer, has emerged as a new learning paradigm, dealing with the same or similar tasks in the target domain by the use of knowledge learned from the source domain [63], [64], [65], [66].A unique advantage of transfer learning is that each individual in the target domain benefits from sharing the common features and weakening or adjusting the different features [11], [60].
For the sake of simplicity, Table I summarizes the main notations used in transfer learning.The measurement space in source domain is described by Z s = {z | K s } and in target domain by Z t = {z | K t }.Because of the difference between K s and K t , the measurement space Z s must be similar to but distinct from Z t .To be specific, Z s is the measurement space in the source domain, and its sample size is normally sufficient to learn K s ; Z t is the counterpart in the target domain, whose sample size is far less than that in the source domain-specific Z s .
As shown in Fig. 2, source data are different from target data, resulting in some common knowledge CK as well as different one DK.As a result, CK may be transferred and reused in the target domain [60].For appropriately handling new tasks, some adjustments or renewed identification of DK (i.e., K t − K s K t ) are also necessary.
It should be emphasized that transfer learning is not a specific machine learning algorithm but a learning paradigm [51].Its implementations can be based on, for example, neural networks [67], [68], [69], [70] and support vector machine [71], [72].Therefore, we call them "transfer learning-motivated FD methods" in this survey.

C. Necessity of Transfer Learning
For designing FD approaches that are motivated by transfer learning, the definitions of residual signals and kernel representations are introduced as follows [73].
Definiton 1: A vector r (k) is called the residual signal if it can reflect the difference between the nominal state and the real response of Models I-IV.Mathematically, the following tendency holds: for a given x(0) and ∀u(k) if the system has no fault.Definiton 2: An operator M is called stable kernel representation of Models I-IV without noises if the following holds: for a given x(0) and ∀u(k) and there is no fault.For example, considering Model I, Wang et al. [74] adopted a left coprime factorization to parameter M and r (k) satisfying (6).
Before starting this review, we shall know that, in FD tasks, the situation transfer learning is necessary and the principle it works.To sum up, there is often not enough informationrich target data for building a new model and performing a faithful FD task.By borrowing the related samples or reusing the previous knowledge in the source domain, the transfer learning-motivated approaches can achieve better FD performance than ones only using the target-domain data.In the following, we delineate several scenarios in which transfer learning is necessary.
1) Deviation between the pilot-scale platforms and actual systems; 2) Heterogeneity in similar automation systems; 3) Diversity among multiple work conditions such as nonconstant operation points and time-varying loads.4) Most importantly, the sample size in the target domain is not enough for building a new reliable model.These aspects also give us a hint of how to select both the source and target domains when we start a practical FD design and its application.It should be pointed out that the assumption hidden in transfer learning-based FD approaches is the existence of CK across domains.

D. Developments of Transfer Learning-Based FD
Combining with knowledge transfer shown in Fig. 2, a review of the latest development of transfer learningmotivated FD approaches is given subsequently.
With the aid of a deep convolutional neural network, Kumar and Hati [75] achieved knowledge transfer and, based on it, proposed a fault detection scheme for squirrel cage induction motors.By adopting a fully connected neural network, Yang et al. [76] designed a state information prediction method for multiple-input-multiple-output systems, which can address a small number of labeled data in the target domain.In order to mitigate the influence caused by the lack of fault labels in training, neural networks were employed to develop a transfer learning-based fault detection approach in [77] to enhance the FD performance.Furthermore, transfer learning was extended to the application of prognostic and health management tasks in [78].For a dynamic rolling system, Dong et al. [79] proposed an improved FD method based on parameter transfer, where the small sample problem was also taken into account.Knowledge transfer is usually adopted to address multiple tasks; for instance, Hasan et al. [80] proposed a transfer learning-based FD strategy by incorporating higher order spectral analysis.In addition, it can also be used for fault classification [81] if the faulty data are collected in advance.In [82], partial valuable knowledge was transferred into the target domain, based on which a weighted adversarial network was designed to fulfill FD tasks.By using a hierarchical structure of convolution neural networks, a novel transfer learning-based FD approach was developed in [83] with consideration of different fault modes.Most recently, Li et al. [84] proposed a transfer learning approach through a generative adversarial network, which showed satisfactory FD performance for the imbalanced data condition.
As illustrated in [1], both the supervised and unsupervised learning-based FD methods can be transformed into each other, based on which, the same or similar FD performance can be achieved.Therefore, unsupervised learning-based FD methods have also been widely applied.Following the concept defined in [1] and Zhao et al. [85] summarized the applications of unsupervised neural networks to transfer learning-based FD, and Cheng et al. [70] achieved knowledge transfer and reuse based on a variational autoencoder.In recent two years, the interest for applying unsupervised learning techniques to knowledge transfer can also be found in [86], [87], [88], [89], [90], [91], and [92], which has witnessed the success of unsupervised learning applied in FD tasks.Certainly, the architecture of deep neural networks affects the learning ability, as well as knowledge extraction and transfer.Therefore, Sharma and Verma [93] investigated the effects of both the depth and width of neural networks on FD performance where transfer learning is present.
Up to now, there are some articles dedicated to presenting the state of the art of transfer learning-motivated FD methods, such as the survey papers [60], [94], [95].Retrospecting the development of transfer learning-motivated FD approaches, it is not difficult to find an intimate relationship between deep learning and transfer learning.
In the practical FD applications, data used for system modeling or feature extraction have a limited sample size, making both the prediction error and obtained FD results strongly dependent on system inputs [96].In addition to quantifying the prediction error through probability inequality equations [96], transfer learning is also popular in FD tasks when the sample size in the target domain is not large enough.Therefore, few-shot learning [97], [98] has also been widely adopted in FD tasks.For instance, Wu et al. [99] utilized meta-learning for addressing few-shot samples and then proposed a transfer learning-based FD approach.With the aid of a dual graph neural network, Wang et al. [100] achieved FD tasks for intelligent manufacturing systems by using few-shot learning, where useful features are learned from the original images and are then transferred into the target domain.It is worth mentioning that both few-shot learning and zero-shot learning also need accurate knowledge that will be summarized in Section III.This knowledge must be known as a priori or learned from sufficient data samples.For example, the modelbased FD method could be regarded as zero-shot learning because the system information is given and can be used directly in FD designs.

III. OVERVIEW OF TRANSFER LEARNING-MOTIVATED FD METHODS: THREE PRINCIPLES
This section systematically reviews basic knowledge of transfer learning-motivated FD methods via two categories and with a premier emphasis on three principles.These bases are theoretically instrumental for designing FD methods with the aid of transfer learning.

A. Overview: Three Principles
Depending on the manner of how to utilize knowledge across domains, all transfer learning-motivated approaches can be divided into two categories: knowledge calibrationbased methods and knowledge compromise-based methods.Schematically, a refined classification of transfer learningmotivated FD approaches is shown in Fig. 3, where the left principal line is developed based on changed knowledge from the source to target domains, while the right one is to compromise the difference across domains.
Regarding these categories mentioned above, their schematic descriptions are presented in Figs. 4 and 5. To be more specific, the following aspects outline the significant differences between knowledge calibration and compromise.
1) Knowledge Utilization: It is the first distinction since different parts of knowledge are involved in two subclasses of methods.As shown in Fig. 5, new knowledge used for FD in the target domain consists of K s and K t (i.e., K s K t ).Therefore, a compromise between K s and K t is achieved via federated learning or weights [94].By transferring the common knowledge and identifying the changed part, Fig. 4 sketches the core idea behind knowledge calibration-based methods.Specifically, new knowledge in the target domain becomes CK DK, where DK = K t − K s K t is the changed knowledge that needs to identify from target data.2) Adjustment Position: Different adjustment positions arise by the distinction of working mechanisms between two subclasses of methods.For knowledge calibration-based FD methods, new knowledge will be used for reformulating the residual generation so that the obtained one can work well for data in the target domain.For knowledge compromise counterparts, after the integration of knowledge across domains, the decision logic will be adjusted, making the test statistic suitable for both the source and target events.Specifically, this kind of FD method attempts to find an enlarged boundary of test statistics so that, in fault-free conditions, the chosen threshold is reliable and reasonable.3) Weight Assignment: Compared to knowledge calibration, knowledge compromise is more in favor of weight allocation between the knowledge across two domains.For   the knowledge calibration-based FD methods, weights assigned to K s and K t must be 0 and 1, respectively.However, the balance/tradeoff between K s and K t plays an essential role in knowledge compromise-based FD methods, resulting in a proper assignment that will be pursued as an objective function.4) FD Performance: Many criteria can be adopted to evaluate the obtained performance of FD approaches.For example, the false detection ratio (FDR) and false alarm ratio (FAR) are two interaction indices that should be considered together.It is the way of the knowledge calibration-based designs, thus achieving satisfactory performance.Unfortunately, the knowledge compromise-based methods can only reduce FARs by enlarging the boundary to accommodate the normal data in both the source and target domains.An accompanied problem is the poor/reduced FD power.Based on these insights and summaries, a brief comparison between the two subclasses of methods is elaborated in Table II, where multiple elements are considered.
In order to reasonably design a faithful FD approach motivated by transfer learning, three principles should be kept in mind: 1) choose a system model that implies the complete system knowledge; 2) determine the changed knowledge that needs to adjust; and 3) define evaluation criteria so that the final FD performance can be evaluated.In Sections III-B and III-C, three instrumental and valuable principles will be detailed under the two general frameworks.

B. Knowledge Calibration-Based Methods
1) Choices of a System Model: Depending on the dynamic behavior or relationship among variables, an automation system can be modeled by different forms.Model selection is an important part of transfer learning and FD.Model selection will directly affect the performance of FD.For example, the transfer function provides an alternative description of (1).In the following, Models I-IV, which are presented in Section II, will be chosen as the mathematical model to describe an automation system.
Denote L(•) as the probability density function.If a system is linear time-variant with significant dynamics, it can be described by Model I, as given in (1).Then, its knowledge can be denoted by Model I: At the steady operation, the knowledge becomes Model II: In addition, for a nonlinear system, its dynamics should be modeled by Model III as presented in (3).The corresponding knowledge is Similarly, the knowledge of the nonlinear Model IV is A cautious check of which model the system belongs to is crucial and should be done before moving on designs of the knowledge calibration-based FD approaches.After that, the second step is to discover and identify the inconsistencies between source and target domains, as introduced in the following.
2) Determination and Identification of Which Knowledge Has Changed: Once an appropriate system model is determined, knowledge information, including system parameters and data distributions, can be uniformly described by ( 7)- (10).For the two different systems with a similar operation mechanism or one system of different working conditions, knowledge calibration can be achieved after a successful identification of the differences by using target data, i.e., DK = K t − K s K t .At the same time, the common knowledge CK is transferred from the source to target domains.
In order to clearly show transferring and adjusting/identifying that are involved in transfer learning, Table III presents several typical examples regarding the four system models mentioned above.This learning paradigm performs well in the presence of changes but leaves the known or preestablished knowledge untouched.In general, only identification of changed knowledge is easier to implement and has a more intuitive physical interpretation.
When any change in system parameters or data distributions occurs, we must change the stable kernel representation M given in Definition 2 because the original M will obtain the nonzero residual signals in fault-and noise-free conditions.Combining with Table III, we detail the necessity of transfer learning in the following, as well as the difference of knowledge in two domains, through the residual signals.
In the source domain, it follows from ( 5) and ( 6) that: where E(•) is the expectation operator.If there is a changed parameter in Models I-IV, (11) in the target domain becomes Since M disables the predesigned residual generators, we have Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE III EXAMPLES OF WHICH KNOWLEDGE CHANGED AND NEEDING TO IDENTIFY IN KNOWLEDGE CALIBRATION-BASED FD METHODS
At the same time, if K s is used in target domain, one obtains Similarly, considering the changes of data distribution, (11) in target domain becomes It is easy to find that and It is worth remarking that for the changes arisen in both the system parameters and data distributions, ( 12)- (17) indicate that In order to handle the nonzero residuals in (18), the relationship between z and M should be updated by transferring unchanged knowledge and, meanwhile, identifying the changed counterpart.
3) Evaluation Criteria: The evaluation performance considers two aspects, i.e., fault detection and FD.
a) Fault detection: Generally, the hypothesis test is adopted for making a fault decision [31], which is also suitable for transfer learning-motivated FD.Let H 0 and H 1 be the null hypothesis and alternative hypothesis, respectively.Then, the fault detection problem in the presence of transfer learning can be formulated as where s is a chosen function, which can be called the test statistic.
Therefore, with the aid of the knowledge CK and DK = K t −K s K t , the objective function of knowledge calibrationbased approaches can be uniformly formulated as where Prob {•} is the probability, J th is the threshold of s(r z ), and α is the given significance level (numerically equal to the FAR).Improving fault detectability can be regarded as an optimization problem, i.e., under a given α how to improve the successful detection rate.In ( 19) and ( 20), the test statistic s(r z ) is described by a collective form.In practical applications, its choices are usually dependent on a specific purpose.For example, some of the commonly used test statistics are Hotelling's T 2 [4], squared prediction error [15], Hellinger distance [101], 2-norm [102], Kullback-Leibler divergence [103], entropy [104], mutual information [105], trace [106], and so on [7].
It is recommended that the (generalized) likelihood ratio can be used to formulate s(r z ) because of its optimality in fault detection [4].In the framework of the transfer learningmotivated FD approaches, defining the log-likelihood ratio on s(r z ) yields It leads to the following expression: which is consistent with (19).
For any test statistic, the ultimate objective of knowledge calibration used for fault detection can be described by where J th is a small drift and is only related to the unknown noise and the identification error.b) Fault diagnosis: After fault detection, diagnosing a fault is of practical interest.Over the past decades, the reconstruction-, algebraic-, structured residual set-, and fixed direction-based FD techniques have been witnessed with popularity in conventional FD methods [33].In fact, knowledge calibration used in fault detection procedures has sufficiently considered the difference between the source and target domains, i.e., the effects in residual signals caused by changed knowledge have been removed as much as possible.It, therefore, allows us to use the well-established FD techniques if and only if performing fault detection tasks follows the aforementioned three principles.In Section III-C, a distinct fault detection idea using knowledge compromise will be introduced, based on which the direct use of these FD techniques will not make much sense.
Without loss of generality, the faults of interest in this review are assumed to be isolatable (i.e., the rank of fault matrices is the same as the number of faults under consideration).It is worth noting that the observability is not a sufficient condition of fault isolability [32].
In order to briefly introduce the basic idea of fault isolation, a fixed direction strategy is adopted in this survey, behind which the foundation is control theory [107].Through the direction concept, an additional purpose is to clarify that for an automation system, FD is different from the classification problem [7].Consider a system that can be described by Models I-IV, and define = {F 1 , . . ., F i , . . ., F m } as the set of all faults, where i is the index of the fault category.Then, a directional residual vector r t,i can be expressed in terms of the signature direction ξ i , i.e., where β i is a scalar whose value is dependent on both the system dynamics and fault amplitude and ξ i is uniquely related to fault matrix/direction F i .As intuitively shown in 6, the direction of r t is closest to ξ 3 , indicating that the fault being detected is most likely f 3 .Instead of relying on observations, certain mathematical indices are more useful in practice.For example, the angle ∡ between the signature direction and r t can be chosen as an effective measure to achieve reliable isolation of a fault, i.e.,

C. Knowledge Compromise-Based Methods
Actually, the implementation of knowledge compromisebased FD approaches also obeys three principles, as summarized in Section III-A.The main difference between knowledge calibration and compromise lies in the working mechanism, including identifying, transferring, and reusing the knowledge across domains.Later in this section, these distinctions will be emphasized via both mathematical expressions and physical explanations.
1) Choices of a System Model: An automation system can be described by one of the Models I-IV, as respectively, shown in ( 1)-( 4).Similar to knowledge calibration, its corresponding knowledge is given in ( 7)- (10).Choosing a proper system model is still the first necessary step.
2) Identification of New Knowledge: In this step, the knowledge needed to identify is different from that in knowledge calibration-based FD approaches since these knowledge compromise-based counterparts attempt to learn all knowledge involved (i.e., K s K t ) rather than the changed part DK.Therefore, this kind of method is not concerned with which part of knowledge is being changed.
Different from knowledge calibration-based methods, Table IV shows several examples of the knowledge compromise-based FD methods related to the four system models.To sum up, this kind of method transfers K s from the source to target domains and relearns the knowledge K t again.Its implementations can be accomplished even if a priori is absent.
Of course, any changes in parameters and signal distributions will also cause unexpected deviations when the knowledge compromise-based methods are considered [94], as explained in ( 11) and ( 14).It, therefore, induces the same conclusion as presented in (18).
For achieving reliable FD tasks, the final step, to reformulate evaluation criteria with the aid of knowledge compromise, will be examined based on K s and K t .
3) Evaluation Criteria: The emphasis will be on fault detection based on knowledge compromise.a) Fault detection: Differing from ( 20), the fault detection problem using the compromised knowledge can be uniformly formulated as follows: Here, K s K t is just a general description, which can be mathematically described by where λ ∈ (0, 1) is a tuning factor.By the use of the compromised knowledge K s K t , the objective function of this class of FD methods becomes Similar to (21), the generalized likelihood ratio can be defined on s(r z (k)), i.e., Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE IV EXAMPLES OF KNOWLEDGE NEEDING TO BE IDENTIFIED IN KNOWLEDGE COMPROMISE-BASED FD METHODS
which also yields (20).Then, ( 26) can be rewritten as In this case, the final objective of using the so-called compromised knowledge is where J th is a significant drift, based on which the boundary J th,t will be enlarged.b) Fault diagnosis: As presented above, this subclass of method tries to enlarge the boundary used for the fault detection purpose.In its implementation, a tradeoff between K s and K t is employed.It inevitably results in the fact that the fault feature ξ i of F i reflected in r t will be changed unpredictably.Therefore, an extension to these well-developed methods, such as algebraic-based algorithms [32] and directional residual vectors [107], to the knowledge compromise-based FD methods is difficult and even out of the question.The essence behind this problem is (K s − CK) involved to formulate and generate residual signals in the target domain.FD using the compromised knowledge is still a challenging topic despite continuous efforts in both the automation and mechanical fields.Further discussions are provided in Section IV-B.

IV. GENERALIZED INSIGHTS AND PERSPECTIVES
Although transfer learning is a new learning paradigm in the field of machine learning, its embryonic forms can be found in a large number of FD literature.Therefore, by reviewing some related FD methods, several bridges between transfer learningmotivated and traditional FD methods are enlightened from generalized insights in this section.Motivated by considerable challenges ahead, an extensive research trend prediction on this topic is perceptively made.

A. Bridges
Due to their simplicities in understandings and implementations, knowledge compromise-based strategies have developed the most fruitful methods [108], [109], [110] under the transfer learning framework.Regarding the knowledge compromisebased FD strategies, theoretical explanations and practical applications are presented in surveys [94], [111].By discovering the nature of these existing methods, several bridges are established in the following content, linking it to traditional FD methods.
1) Multirate Approaches: Automation systems are usually equipped with various types of sensors due to practical requirements such as real-time control and monitoring [7].However, the accompanying problem is an irregular sampling rate, i.e., the mixture of fast-and slow-rate measurements [112].It has naturally catalyzed the development of multirate FD methods [112], [113], [114].
It can be found from [14], [115], [116] that transfer learning has been widely adopted to address this problem.The same principle between traditional FD [112], [113], [114] and transfer learning-based FD [14], [115], [116] methods lies in that, by transferring knowledge across domains, the obtained knowledge becomes suitable for all sampling rates (sometimes called original data).
2) Multimode Approaches: Multimode is a frequent characteristic in automation systems such as chemical processes [12], [117], [118], [119] and electrical drive systems [120].In [118], the unchanged knowledge among different modes is obtained via subspace decomposition.From the viewpoint of transfer learning, it can also be called common knowledge CK.This part of unchanged knowledge is transferred across modes because different modes can be regarded as the different domains in transfer learning.
For the multimode system, a general model covering a wide spectrum of modes becomes a favorite choice.For example, Zhang et al. [121] developed a model migrationbased FD approach where common information/knowledge among different modes can be shared and transferred.In this way, it provides an exploratory attempt toward transfer learning-based FD methods.More recently, a transfer learningmotivated FD method is developed in [110] to deal with the case in which no faulty samples can be used in some specific modes.
3) Multitask Approaches: Essentially, multitask learning aims to obtain a model with a generalized ability from a single task.Such an idea can be easily applied in the FD research, especially for an automation system that trends to have a higher degree of integration.Following this idea, Zhang et al. [122] employed a multitask FD system where the knowledge can be shared in a deep neural network.By transferring CK in multiple tasks, an end-to-end model is developed in [123].In fact, the idea behind the multitask approaches is transfer learning.

B. Challenges 1) Fault Diagnosis:
After successful detection of faults, it is an ultimate goal to diagnose and isolate faults.The traditional FD methods remain valid in knowledge calibration-based approaches.However, performing FD tasks using knowledge compromise-based schemes becomes extraordinarily difficult.
The main reason is that no accurate models (such as the signature direction ξ i ) are available for target data so that both isolating and identifying faults are impractical.As mentioned in Section III, the reduced fault detection power also contributes to such a failure.From this viewpoint, the knowledge compromise-based schemes are a double-edged sword since their applications are achieved at the cost of FD performance.Therefore, knowledge compromise-based FD deserves increasing attention in the coming years.
2) Valid Generalization: As mentioned in Section II, transfer learning is suggested to use only if no sufficient data is available in the target domain.It, therefore, poses more difficulties in identifying changed knowledge, especially for (K t − K s K t ) in Models III and IV.For example, it is necessary to choose some nonlinear operators when considering nonlinear systems.A series of inevitable questions will arise: how to determine the nonlinearity degree of the changed knowledge?how much data is necessary to ensure a valid generalization?and how well the developed method make a faithful FD decision in the case where target data differs from source data?With consideration of empirical FARs and FDRs, the critical issue in transfer learning-motivated FD methods is the generalization ability.
To learn and identify the changed knowledge, one usually suffers from the overfitting or underfitting problem.A transfer learning method that is not sufficiently complex will fail to learn the complex knowledge that varies across different domains, leading to underfitting.However, too complex methods may fit the noise partially, resulting in overfitting [124].In transfer learning, underfitting produces a bias estimation, whereas overfitting generates excessive variance.Generally speaking, overfitting is more dangerous because it enables a situation where the obtained knowledge is far beyond its true one [125].
3) Zero-Shot Learning: Zero-shot learning is a new technique coined by Larochelle et al. [126].Its objective is to judge one sample in the online phase belonging to the known classes or a new class that has never been observed before.Such a problem has been widely studied in the field of machine learning.As we can observe, its appearance usually accompanies by transfer learning [79].
As pointed out in [7], FD belongs to a specific supervised learning method whose precondition is the independence between the result and fault amplitudes.As a result, training sets are labeled to mine and identify the fault directions.When an online fault whose type is different from the known fault classes occurs, the predesigned FD algorithm cannot recognize it correctly, resulting in a wrong decision.In this case, the fault library is incomplete and needs an online update.In order to address this issue, zero-shot learning may provide us with a solution.In fact, ensuring the generalization ability, feasibility, and explainability of FD approaches using zero-shot learning is more important than a simple implementation.It would be impossible when the control theory is absent.

C. Perspectives
Based on the discussions above, we present potential research topics in the following.
1) Knowledge Migration: Retrospecting its definition in [127], transfer learning intends to store the previous knowledge and reuse it in a new task.In [7], it is called knowledge or model migration.Based on the necessities in Section II, we delineate three classes of FD methods from this viewpoint.They will be of great interest in practical FD applications and are summarized as follows: 1) knowledge migration among similar application objectives; 2) knowledge migration from laboratory or experimental conditions to actual applications; and 3) knowledge migration from one FD method to another similar FD method.For example, inductive learning and deductive learning may be powerful strategies that enable knowledge migration to become possible and easy.
2) Online Transfer: A system model based on the first principles is popular in actual engineering applications, even if it may not be accurate.For example, a traction system can be modeled by state-space equations with an assumption on the linearity of a magnetic circuit [7].However, numerous facts, such as unknown external disturbances and the coil temperature, unavoidably weaken the applicability of original system models.Hence, the best remedy is the online knowledge transfer that can adjust the system model, together with optimizing system performance, in a real-time manner.
Further about this point, there are at least two promising solutions.To be specific, the former can be the recursivebased FD algorithms that can update system knowledge and the decision logic in real time.The latter can be the plugand-play technique [128], based on which online optimization can be achieved by bypassing the modification of the original system structure and controller.
3) Negative Transfer: In practical applications, a common phenomenon is that the data labels in both the source and target domains may be lost or mismarked.In this case, a fundamental task is how to correctly and effectively borrow samples or reuse knowledge from the source domain.Therefore, the source data can be divided into two classes, i.e., the positive and negative samples.Correspondingly, the reused knowledge from the source domain is called either positive or negative knowledge, depending on whether the knowledge improves or degrades the FD performance in the target domain.
Generally, negative transfer is also a learning behavior that cannot be avoided.However, it motivates us to improve the FD performance by minimizing the effects caused by negative knowledge.It is expected that more related investigations could emerge in the future.
4) Shrinkage Estimator-Aided Transfer: As specifically mentioned in Section II-C, the sample size in the target domain is not enough for identifying all the knowledge.It also indicates that a new model contains more knowledge than Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the changed part.Therefore, the objective of transfer learning can be reformulated as: given a few data samples, to identify the changed knowledge as accurately as possible.Based on the weak law of large numbers or some probability inequalities [96], by fixing the number of samples, fewer parameters needing identifications can actually obtain better performance.
The facts mentioned above motivate us to, by using the shrinkage operator, learn the changed knowledge in the design of the transfer learning-motivated FD algorithms.For example, the least absolute shrinkage and selection operator [129] is well-suited for solving such a problem in linear regression, due to its ability to enhance prediction accuracy and improve generalization by constraining some parameters to be zero.Beyond that, other linear and nonlinear shrinkage estimators (such as ridge regression) can also be adopted in the transfer learning-based FD methods.Such a topic will be of great interest in FD research using transfer leaning.
5) Entropy-Based Transfer: Mathematically, entropy is a metric that can measure uncertainties or randomness.For example, the Shannon entropy can be defined on discrete signals, serving as a powerful tool in information theory [130].Recalling two objective functions in ( 19) and ( 29), we can rewrite them in entropy-based forms such as Kullback-Leibler divergence to perform FD tasks [103].Entropy is referenced and employed more frequently than before in the recent FD studies, such as [131] and [132].Certainly, the objective function of the transfer learning-motivated FD methods could be defined by entropy.For instance, entropy can be used to evaluate the uncertainties and performance of the designed residual generator.It can be easily verified that the residual signals using knowledge calibration have smaller uncertainties than ones obtained via knowledge compromise.Therefore, many efforts could be spent on entropy-based knowledge learning and transfer in future FD research.
Apart from the aforementioned potential research directions, more attention can be paid to system properties when employing transfer learning in performing FD tasks.A partial list of typical automation systems is summarized in Fig. 7.

V. REAL-WORLD APPLICATION
In order to understand the necessity and implementation of transfer learning-motivated FD approaches, a real-world Fig. 8. Performance degradation in traction systems of high-speed trains [31].

TABLE V
MAIN PARAMETERS OF TRACTION SYSTEMS [7] application is considered in this section, where both knowledge calibration and compromise are involved.

A. Traction Systems With Performance Degradation
As time goes on, the high-speed trains on active service suffer from performance degradation.For example, the aging traction motors are given in Fig. 8, resulting in performance degradation of the whole traction system.For the traction systems shown in Fig. 8, additional parameters, including degradation degree of actuator performance and fault amplitudes, are summarized in Table V.

B. Knowledge Transfer for FD 1) Knowledge Learning in the Source Domain:
There are two general ways to learn source domain-specific knowledge: 1) building the model of traction system according to the first principles and 2) learning an equivalent model using input and output data.Define two stator voltages and stator currents, respectively, as the system inputs and outputs.Then, the obtained knowledge, corresponding to Model I, is K s = {A, B, C, D, L(w), L(v)}. ( By the use of K s , the normal region can be obtained when the operation point of traction systems is given.As shown in Fig. 9, two normal regions for y 1 and y 2 can be obtained. 2) Knowledge Calibration: Consider the actuator performance degradation B = 2%B whose duration starts at the 1001st step.Then, we can use the degraded data to learn B, by cooperating with K s or K s -based data.In the knowledge  FD results using knowledge calibration, where the changed knowledge is used for reformulating normal regions.calibration-based FD methods, new knowledge has been built for completing FD tasks, i.e., K t := {A, B t , C, D, L(w), L(v)}, B t = B − B. (33) As we can find, K t is obtained through knowledge calibration, where DK = K t − K s K t is the changed knowledge to learn.The other components of source domain-specific knowledge K s , including CK = {A, C, D, L(w), L(v)}, can be transferred and reused to obtain K t .
Based on K t , Fig. 10 presents the detection results for two sensor faults in traction systems.Based on the new normal regions after calibration, the knowledge calibration-based FD methods show satisfactory performance when traction systems suffer from actuator performance degradation.
3) Knowledge Compromise: For knowledge compromisebased FD approaches, the target-domain-specific knowledge (i.e., the knowledge in actuator performance degradation scenarios) will be where K s and K t are defined in (32) and (33), respectively.It indicates that the knowledge after compromise will be more tolerant for both the changed knowledge B and faulty FD results using knowledge compromise, where the changed knowledge is used for enlarging normal regions.signals.As shown in Fig. 11, the new knowledge K s K t obtains new regions by tolerating DK, resulting in the larger boundaries.When fault features fall in the new regions, they cannot be detected.The FD results with consideration of actuator performance degradation in Fig. 11 illustrate our conclusions, where knowledge compromise-based FD approaches show the effectiveness for detecting f 2 but fail in detecting f 1 .

VI. CONCLUSION
Beginning with an emphasis on the basics of FD techniques, this article has sketched two frameworks of transfer learningmotivated FD designs, reviewed some recent results, and detailed insights from a generalized viewpoint.The main focus has been on three principles that transfer learningmotivated FD methods should follow.Compared with the well-established traditional strategies, the transfer learningmotivated FD methods as discussed in this survey article are still in their infancy.Due to their adaptive and selfadjusting specialties, an explosive growth of the transfer learning-motivated FD methods will be witnessed to provide exceptional capabilities in processing the changing knowledge representation of automation systems.We apologize for any omission because new publications are continuously emerging.To this end, some remarks that deserve more attention are emphasized as follows.
1) Rethinking the essence behind transfer learningmotivated FD methods and keeping the three principles in mind can provide engineers with valuable suggestions to intelligently and effectively transfer knowledge.2) In an online FD task for time-varying systems or the systems of working environments, a bonus of transfer learning is its avoidance of recursive iterations that are usually time-consuming.3) Online transfer learning is functionally similar to plug and play because self-configuration and adjustment aim at a fraction of knowledge, i.e., the changed part.4) There should be sufficient data samples to identify the changed knowledge and parameters while being insufficient enough to learn all knowledge of target domains.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
It is worth noting that few-shot learning is also subject to the minimum condition that the number of samples can meet the identification of the changed knowledge and parameters.This must be a precondition when transfer learning is adopted in FD tasks.

Fig. 3 .
Fig. 3. Unified classification of the subject of transfer learning-motivated FD approaches.

Fig. 6 .
Fig. 6.Fault isolation using directional residual vectors under the transfer learning framework.

Fig. 7 .
Fig. 7. Potential applications of transfer learning-motivated FD for automation systems.

Fig. 9 .
Fig.9.System operations driven by source domain-specific knowledge, where the regions marked by red dotted lines represent normal operations in the source domain.

Fig. 10 .
Fig. 10.FD results using knowledge calibration, where the changed knowledge is used for reformulating normal regions.

Fig. 11 .
Fig. 11.FD results using knowledge compromise, where the changed knowledge is used for enlarging normal regions.

TABLE I SUMMARY
OF NOTATIONS USED IN TRANSFER LEARNING

TABLE II BASIC
COMPARISONS OF MULTIPLE INDICES BETWEEN THE TWO KINDS OF METHODS