Prognostics with Variational Auto-encoder by Generative Adversarial Learning

—Prognostics predicts the future performance progression and remaining useful life (RUL) of in-service systems based on historical/contemporary data. One of the challenges for prognostics is the development of methods which are capable of handling real-world uncertainties that typically lead to inaccurate predictions. To alleviate the impacts of uncertainties and to achieve accurate degradation trajectory and RUL predictions, a novel sequence-to-sequence predictive model is proposed based on a variational auto-encoder (VAE) that is trained with Generative Adversarial Networks (GAN). A Long Short-Term Memory (LSTM) network and Gaussian mixture model are utilized as building blocks so that the model is capable of providing probabilistic predictions. Correlative and monotonic metrics are applied to identify sensitive features in the degradation progress, in order to reduce the uncertainty induced from raw data. Then, the extracted features are concatenated with one-hot health state indicators as training data for the model to learn end-of-life (EoL) without the need for prior knowledge of failure thresholds. Performance of the proposed model is validated by health monitoring data collected from real-world aero-engines, wind turbines, and lithium-ion batteries. The results demonstrate that signiﬁcant performance improvement can be achieved in long-term degradation progress and RUL prediction tasks.

P ROCESS safety, system reliability, and product quality are becoming increasingly essential in the modern industry [1]. Machine condition monitoring (MCM), a maintenance strategy that involves the repair and replacement of damaged parts to reduce the total life cycle costs, is a vital part of many industries such as aerospace, energy, automotive, and heavy industry. Traditional strategies such as corrective (breakdown) and preventive (scheduled) maintenance are becoming less capable of meeting the increasing industrial demand of efficiency and reliability [2]. Prognostics and health management (PHM) is a novel paradigm that enables realtime health assessment and future condition prediction. PHM incorporates various disciplines (e.g., sensing technologies, signal processing, machine learning, and reliability analysis) and provides an intelligent MCM strategy to maintain system's originally intended functions [3] or even distinguish whether a local malfunction will affect the key-performance-indicators of the whole system [1].
Prognostics of in-service systems is a pillar of PHM that can be sorted into two types: 1) remaining useful life (RUL) evaluation (i.e., event prediction); and 2) future degradation estimation (i.e., event progression prediction). Data-driven approaches, which use information of the current and previous usage conditions to identify the characteristics of the contemporary degradation state and to predict the future trajectory, have been regarded as a powerful solution for prognostics [4] and achieve success in cyber-physical systems [5], [6]. Machine learning, as the most common data-driven technique, is able to act as a bridge connecting big machinery data and intelligent prognostics [5]. For example, a deep convolutional neural network has been proposed to map monitored feature data to machine health status [7]. The combination of neural networks and fuzzy systems has been employed successfully to capture more information for PHM [8]. Further, Elforjani et al [9] employed three supervised machine learning techniques: a support vector machine, multi-layer neural network, and Gaussian process regression to estimate the RUL for slow-speed naturally degrading bearings using acoustic technology. Recently, a so-called "vanilla" long short-term memory (LSTM) network has been utilized in [10] to improve the accuracy of RUL prediction for complicated industrial processes.
However, the above-stated data-driven based prognostic systems might have potential concerns outlined as follows: 1) Feature representation: the data for prognostics are usually formulated sequentially, where hidden features behind these sequences are vital for representing a system's health condition. However, handcrafted features may not perfectly represent the degradation throughout the lifetime. 2) Prior domain knowledge: RUL is calculated by subtracting end-of-life (EoL) from a system's current cycle. Generally, EoL is predicted: a) by mapping features to a piece-wise linear RUL curves [11] or using linear relation between features and EoLs (e.g., Max.E-EoL [12] ); b) when the degradation level reaches single or multiple pre-defined thresholds [13] (e.g., E-trend [12] ). Both methods require sufficient expert knowledge either to build linear relations or define thresholds that differs between scenarios, consequently hindering their flexibility. 3) Multimode degradation: in reality, a system can degrade in different manners (i.e., variety of degradation process) even though it undergoes the same operation. Most prior or current prognostic models have been conducted on simple naturally degenerated data with fixed initial parameters. Therefore, it is critical to make the predictive model be adaptive to different degradation modes.
Despite various machine-learning algorithms applied, it has been demonstrated in [14] that feature representation determines the upper-bound performance of models. In many cases, degradation data cannot be collect directly, with system feedback used as an alternative. For instance, bearing vibration signals [15], [16] are commonly used for gearbox RUL prediction. It is difficult to extract effective features enriched with degradation characteristics from measurements directly, and conventional signal processing and feature extraction techniques often limit the ability to reveal intricate correlations [17]. Therefore, the instrumentation and feature extraction scheme should be carefully developed. A complete data-driven method has been proposed in [18] to produce system health indicators automatically, without a priori on system monitoring or signals processing. A local featurebased gated recurrent unit network has been proposed in [19] to generate feature sequences without requiring a high-level expert knowledge. RUL prediction is achieved by subtracting the current cycle from the predicted EoL. The most commonly used method for predicting EoL includes labeling the training data auxiliary, where each sample is required to associate with its RUL label as a target. In this case, the piece-wise linear method [11] is usually adopted. This requires extra work and is generally very time-consuming. Moreover, if the available label information is limited, the advantage of machine learning could be minimal. To overcome this, generative adversarial networks (GAN) based models are proposed in [20], [21] to cope with the insufficiency of health data for asset reliability prediction. Another method to predict EoL requires defining a failure threshold in advance. EoL is therefore assumed to occur when a health indicator exceeds that threshold. For example, an appropriate threshold is required to separate the hyper-plane of the high dimensional features in support-vector machines. However, sufficient expert knowledge of critical components' failure threshold is not always readily available and human factors introduce much uncertainty, which is difficult to model and brings complexity to the analysis and synthesis procedures [5]. Furthermore, it is not appropriate to use a single threshold to summarise all failure modes.
The degradation progresses for the same mechanical system are variate. It may undergo multi-mode degradation triggered by many inducements including: enclosure problems, excessive operation, lack of maintenance, and corrosive environments. Olof Mogren [22] suggests that generative adversarial networks (GAN) [23] are a viable way of modeling a distribution over different types of sequential data. Ha et al [24] demonstrate that GMM-RNNs can learn to forecast from an immense amount of observations from multi-scenarios.
Inspired by previous work, this paper proposes a novel data-driven approach based on GAN, focusing on enhancing the predictions of long-term degradation and remaining useful life without pre-defining specific component failure thresholds. The Generator in GAN is a novel adaptation of sequenceto-sequence variational auto-encoder (VAE) derived from the combination of LSTM and Gaussian mixture model (GMM) and the Discriminator is a bidirectional LSTM. The contribution and intellectual merit of this research is two-fold: 1) An LSTM-GMM-based VAE as a generative model is proposed, that is fed with time-series data and combined with a one-hot health indicator to bypass low accuracy prediction generated from an imprecise pre-defined failure threshold. In this model, the concatenation of GMM with LSTM networks allows modeling and predicting of different modes of degradation scenarios in a single neural network conditioned on previous records. 2) The proposed VAE model is trained through an adversarial learning approach, i.e., GAN, to enhance the accuracy and robustness of long-term degradation and RUL prediction. The model's effectiveness is quantified by health data collected from various real-world engineering systems, including aero-engines, wind turbines, and lithium-ion batteries. The remainder of this paper is arranged as follows: Section II formulates the problem. Section III presents the methodology in detail. Section IV presents the experimental results. Finally, Section V contains concluding remarks.

II. PROBLEM STATEMENT
Let x (i) denotes a vector of multivariate sensor measurements such that x (i) = [x (i;1) , · · · , x (i;m) ], where m is the number of sensors. Formally, a sensor measurement sequence is described by X = {x (0) , · · · , x (n−1) }, where x (i) ⊆ R m and n is the total time steps of observations. The ground-truth of prediction at the next time step is denoted by Y = {y (1) , · · · , y (n) }, where y (i) = x (i) , i = 1, · · · , n. The dataset D is defined by D = {(x (i−1) , y (i) )} i=1,··· ,n . In general, the data-driven prognostics approach is to learn the best predictor of run-to-failure degradation from the previously observed data, i.e., a training dataset D T . Then, based on the prediction from the trained model, the RUL can be estimated at each time step. This problem can be formulated by finding a non-linear mapping function F : x ′ → z, with a latent variable z ⊆ R Nz and N z < m. Subsequently, the optimal predictor can be formulated as a function of z shown below: where γ is the parameter of the nonlinear mapping function that needs to be optimized through training f γ (z). The primary goal of this paper is to develop a data-driven approach to learn the non-linear mapping function f γ for degradation progress modeling and stable RUL prediction.

A. Feature Extraction & Selection
As shown in Fig. 1, feature extraction and selection is an essential first-phase preparation of prognostics. In this process, critical features that contain sufficient degradation signatures will be identified in order to increase the efficiency and reliability of prognostics by: 1) reducing the cost of feature measurement; and 2) minimizing the dimensions of data required to describe the degradation progress [17], [25].
The majority of mechanical systems normally undergo gradual degradation rather than breaking down unexpectedly. In consideration of the reality that the ideal features for prognosis should practically indicate the system degradation trajectory throughout the lifetime, i.e., should be monotonous, Pearson's correlation [26] and monotonic metrics [27] are implemented to evaluate the degradation-sensitivity of features f t from the initially sampled raw data X . Specifically, the monotonic metric in Equ. (2) evaluates the ascending/descending trend of features, and the Pearson's correlation metric in Equ. (3) assesses the correspondence between features and time, shown in the following: where T is the sequence length, x (t;m) is the m-th feature at time step t, x could be sensor measurements (e.g., vibration) or statistics (e.g., amplitude in frequency spectrum), (·) denotes mean operation, and d(·) denotes differential operation. Feature selection is accomplished based on a linear combination of correlative and monotonic performance, i.e., α · mono + (1 − α) · cor, where α is a trade-off hyper-parameter set to 0.5 here. A feature with the highest criteria value will be selected to discard those irrelevant or redundant features.

B. Feature Transformation
Predominantly, to solve the nonlinear equation y thold = F(t EoL ;θ) for RUL prediction, it is necessary to find the time cycle (t EoL ) when the degradation reaches a certain level y thold . Therefore, the RUL can be determined from RU L = t EoL − t current . Defining y thold requires sufficient expert knowledge of the system. When the system is intricate and the failure modes are various, it is unmanageable to define a certain threshold to represent all failures.
With the intention of enabling model learning t EoL , the selected feature is concatenated with an additional Health Indicator (HI) to represent the machinery degradation process instead of using it directly as follows: where f t ⊆ R k , k is the dimension of selected features. Unlike the prognostic method in [28] attempts to map HIs with RULs, here HI = (h 1 , h 2 ) is generated by one-hot encoding with only two values. Specifically, (1, 0) suggests that the equipment is healthy and currently in operation while (0, 1) suggests that the equipment has undergone failure and requires maintenance. The initial value is set as S 0 = (0, 1, 0). To develop a simple, robust approach that works well for a broad class of degradation series, the method first re-formats each series to a fixed length of T max , where T max is the longest sequence in the training dataset indicating the slowest degradation mode. In principle, T max can be considered as a variable reading from the training data directly. For those sequence S whose length T s is shorter than T max , S t is set as (0, 0, 1) for T s ≤ t < T max to make all series of the same length T max . In addition, the min-max normalization and mean filtering are implemented in advance to eliminate the influence of noise to some degree. The model training will be discussed in detail in next section.

C. LSTM-GMM based VAE with Adversarial Training
The long short-term memory (LSTM) neural network is a significant branch of recurrent neural networks (RNNs) that are often used to model sequences of data [29]. However, a standard LSTM generates sequences one data point at a time, which does not work for an explicit global sequence representation. Considering our problem scenario is similar to that of [20], [30], variational autoencoder (VAE) is a good option since it has been verified to be an efficient stochastic variational inference and learning algorithm that scales to large datasets. Moreover, VAE even works in the intractable case under some mild differentiability conditions [30]. Here, an LSTM and GMM based sequence-to-sequence VAE with adversarial training is proposed, referred to as VAE-GAN, seeking to incorporate distributed latent representations of the entire sequences (life-long degradation) with various degradation modes. The adversaries, a generator G and a discriminator D, are two different deep recurrent neural networks. Generator: As shown in Fig. 1, the encoder in G consists of two LSTMs, taking the input in both directions to obtain two hidden states. Specifically, at each time step t, the encoder in the generator takes all available observations S F Ts as well as the same observations in opposite order S B Ts as inputs, and outputs two hidden states h F Ts and h B Ts as: Then, a fully-connected layer is employed to map the concatenated hidden state [h F Ts ; h B Ts ] into µ and σ. In addition, the exp operation is applied to σ due to the non-negativity of standard deviation as: and then, the latent vector z is set up as follow [30]: where ǫ ∼ N (0, I) and ⊙ signifies the element-wise product. Therefore, the latent vector z is a learned vector of dimension N z constrained on the input sequence instead of a deterministic yield given certain input. Such encoding scheme enables modeling the case under mild different conditions [30]. The decoder in G is an unidirectional LSTM network. At each time step t, the decoder takes in the previous data S t−1 and the latent vector z as a concatenated input x t−1 . This format of input implies that the generated consequent hidden state is conditioned on the latent z sampled from the encoder that is trained end-to-end along the decoder. The computation of the decoder can be described as follows: where the input of the model is x t−1 and h t−1 , and the initial hidden states h 0 of the decoder are the yield of a connected layer, that is h 0 = tanh(W z z + b z ). The output at each time step is the hidden state h t , which are the parameters for a probability distribution of the consequent data S t .
A fully-connected layer is used to project the hidden state h t into the output y t , y t ⊆ R 3M +2 , which can be split into M mixed Gaussian distributions to describe f t and one categorical (p 1 , p 2 ) distribution to describe health indicator HI: The feature f t in S t described by Gaussian mixture model with M normal distributions at each time step is given as: where µ i and σ i are the mean and standard deviation of the ith univariate normal distribution respectively; π is a categorical distribution of length M with M i=1 π i = 1, representing the mixture weights of the Gaussian mixture model.
Due to the probability constraint and the non-negativity of standard deviations, exp and sof tmax operations are employed. The probabilities for the categorical distributions are calculated using the outputs as logit values as: Discriminator: The discriminator D is a bidirectional LSTM network that allows us to take into account the input time series in both directions [22]. The outputs of each LSTM cell in D are fed into a fully connected layer with weights shared across time steps. One sigmoid output per cell is then averaged to the final decision for the sequence.
Loss & Training: The model is trained by simultaneously updating the discriminative distribution so that it discriminates between samples from the data generating distribution p data (S) and from those of the generative distribution p g (Ŝ) [23]. First, the optimization of the discriminator D given generator G is described as follows. Similar to the training of Sigmoid function-based classifiers, it involves minimizing the cross entropy. The discriminator loss function L D is formulated as follows: where L D GAN (θ d , θ g ) is the standard GAN loss and L D 2 (θ d ) is the standard L 2 regularization defined as follows: where S t is sampled from the ground truth degradation data and G(x t ) =Ŝ t is the corresponding generated samples. Then, fix D and optimize G to minimize the discrimination accuracy of D. The reconstructed loss function Equ. (18) is the sum of four terms: the standard GAN loss in Equ. (19) of Output: return S =Ŝt, t = tp, · · · , tEoL generator L G GAN , the log loss in Equ. (20) of feature variation L G f , the log loss in Equ. (21) of health state terms HI, and the Kullback-Leibler divergence loss L G k in Equ. (22) representing the difference between the distribution of latent vector z with a Gaussian distribution with zero mean and unit variance: Note that the Gaussian mixture model parameters modelling the f t beyond T s are discarded when calculating L G f , while L G h is calculated using all of the categorical distribution parameters modelling the health indicator HI until T max . Both terms are normalized by the total sequence length T max . The practice of loss definition of L G h + L G f was found to be more robust and empowers the VAE to learn the EoL straightforward. We empirically update the parameters of D for k times and then update G once. The optimization process between G and D alternates and improves their performance gradually. The global optimal solution is achieved if p data = p g , meaning when the discrimination ability of D has been improved to a high limit but cannot correctly discriminate further, it is thought that G has captured the distribution of the real data.

D. Prognostics
Degradation Prediction: After sufficient off-line training, Algorithm 1 shows the pesudo code used to make a presentto-EoL prediction at time step t p , when given previous data collected in 0 < t < t p . At time t, the generator takes the transformed feature data S t as input and outputs y t as the parameters of a probability distribution of the data point S t+1 . During the prediction process,Ŝ t is sampled based on the Gaussian mixture model parameters and categorical distributions at time step t. Unlike during the training process, the predictedŜ t is fed into the next time step t + 1. The prediction process will continue until HI = (0, 1) is achieved. RUL Prediction: The procedure of RUL prediction is based on the present-to-EoL prediction. Starting from the prediction time t p , the algorithm calculates the prediction until the health state indicator HI = (0, 1) is obtained at time step t EoL . Then the predicted remaining useful life is defined as: where t p = 1, 2, · · · , t EoL represents the RUL prediction and can be carried out at each time step.

A. Experimental Setup
Dataset Description: In this paper, three types of data (aeroengine, wind turbine and lithium-ion battery) have been employed to verify the effectiveness and flexibility of the proposed method in different industrial applications. 1) MAPSS: The aero-engine data, provided by Modular Aero-Propulsion System Simulation (MAPSS), consists of multiple multivariate run-to-failure recordings (24 sensors and 3 operational settings, details see [31]) from a fleet of aeroengines with dissimilar levels of initial wear and unspecified manufacturing disparity. Sensors with constant records are eliminated for contributing nonsensical information. To select the optimal representative degradation feature (i.e., sensor), all features are assessed by correlative and monotonic metrics. As shown in Fig. 2, the 11th (static pressure at HPC outlet), the 12th (ratio of fuel flow to Ps30), and the 13th (corrected fan speed) features produce similar high criteria values. To simplify, the 11th feature is selected.
2) HSSB: The wind turbine generator (WTG) vibration data, provided by Green Power Monitoring System, was collected through 50-days operation of real-world 32222-J2-SKF tapered high speed shaft bearings (HSSB) installed in a 2.2M W WTG with typical shaft speed of 30Hz, ending with an inner race fault. By assessing the typical time-series statistical features (mean, std, skewness, kurtosis, and energy), kurtosis was found to be the most representative degradation feature, shown in Fig. 2. Visually, it undergoes a growing tendency of decay at an early stage and an accelerated growth at the end.
3) Lithium-ion: The dataset [32] consists of 124 commercial lithium-ion batteries cycled to failure under various fast-charging conditions. These lithium-ion phosphate (LFP)/graphite cells (1.1Ah, 3.3V ), manufactured by A123 Systems, were cycled in horizontal cylindrical fixtures on a 48channel Arbin LBT potentiostat in a 30 • C forced convection temperature chamber. To capture the electrochemical evolution of individual cells during cycling, the cycle-to-cycle evolution of the discharge voltage curve is considered as the most representative degradation feature according to Fig. 2.
In each case, every selected feature is sampled to a fixed sequence length T s at intervals of n = ceil (T /T max ) (where ceil(·) returns a ceiling value), to form a training dataset. At each time step t < T s , the health indicator HI = (1, 0) was concatenated to f t while HI = (0, 1) at T s ≤ t ≤ T max . To facilitate computational efficiency, T max is regarded as a hyper-parameter here and manually set to 100. Model Layout Details: The LSTM network in generator G consists of 256 internal (hidden) units. The number of components for the Gaussian mixture model is set to M = 10. Discriminator D has a bidirectional layout, while G is unidirectional. Baseline Model: Two distinct baseline models have been employed to make a comparative study. By removing the encoder, a pure decoder LSTM as a baseline is an auto-regressive model without latent variables, self-trained entirely with loss function L = L G f +L G h to predict the next status at each time step in the recurrence. The second one is VAE (G) (without adversarial training) with loss function L = L G f + L G h + L G k . Implementation: Back propagation through time (BPTT) and mini-batch stochastic gradient descent (SGD) were used [23], with the batch size set to 10. The model was pre-trained for 10 epochs with loss function L G = L G f + L G h + L G k . Layer normalization and recurrent dropout with a keep probability of 90% was applied. The learning rate was set to 0.001 and gradient clipping of 1.0 was used. The 5-fold cross validation is employed for parameter tuning. The implementation was built based on the Tensorflow platform equipped with NVIDIA Geforce GTX 1080 Ti and Titan Xp GPU with 32 GB memory. Evaluation: The models are compared by performance feedback from prognostics metrics: Prognostic Horizon (PH), α−λ Accuracy, Relative Accuracy (RA), and Convergence. PH is defined as the difference between the EoL and the first time when the prediction result continuously resides in the accuracy zone, which has a constant bound with a magnitude of α error with respect to true EoL. The α − λ accuracy determines whether a prediction falls within specified limits (α of the actual RUL) at specific circle t λ , which is expressed with a fraction of λ between starting cycle of RUL prediction t p (λ = 0) and EoL (λ = 1) as: RA is the relative error rate between the true and predicted RUL over α error zone at t λ , shown in Equ. (25).
Convergence (CoM, Equ. (26)) is defined as the Euclidean distance between (t p , 0) and the centroid (t C , E C ) of the area under the relative error curve E(k) between t p and EoL.
The performance of degradation prediction is evaluated using the mean absolute error (MAE) on generated output as: where T s is the length of series,Ŝ t is the predicted value, S t is the ground truth, and N is the number of series.

B. Results and Discussion
The present-to-EoL prediction results for MAPSS, HSSB and Lithium-ion generated by the baselines (decoder LSTM, VAE) and the proposed VAE-GAN model are shown in Fig.  3 by columns, respectively. In each row, one typical test series for each case is visualized at every 10 time steps as an example. The M AE decay for all test series are listed in TABLE I. The starting circle of prediction t p is 20 since the indicator remains stable at the beginning. By utilizing decoder LSTM as a standalone predictive model, the degradation prediction can be conditioned on the previous points. Specifically, the decoder LSTM is employed at first to 'encode' the observations into a hidden state h. Afterwards, h is used as the initial hidden state to yield the remaining degradation prediction. The degradation curves predicted by decoder LSTM roughly represent the real trend and the degradation distribution becomes closer to the ground truth in the second half of the lifetime. As more observations become available, the result can be more accurate. According to TABLE I, VAE is able to produce predictions more accurate than decoder LSTM as the M AE decay converges faster to a certain level (e.g. M AE decay < 0.02 in MAPSS) after time step 50. Furthermore, the smaller M AE decay by VAE in the first half of the lifetime indicates that VAE is able to generate better degradation predictions than decoder LSTM even at early time. The main reason lies in that the prediction is conditioned not only on the previous observations, but also on the latent vector z encoded by bidirectional LSTM. As TABLE I shows, VAE-GAN outperforms the baseline models at the early stage as M AE decay within 0.05 (e.g. MAPSS), which indicates that adversarial training helps VAE capture the distribution of real data better. Fig. 4 depicts the RUL predictions of three typical series with the related RUL ground truth as reference, i.e., series #1 (Lithium-ion), series #2 (MAPSS) and series #3 (HSSB). The RUL prediction is visualized at every 5 time steps. As the prognostics horizon performance is illustrated in Fig. 4, a longer PH implies more time to take corrective action based on a prediction with some credibility. The allowable error bound α with respect to EoL ground truth is set to 0.05. Evidently, the PH of VAE is wider than that of decoder LSTM, and the VAE-GAN further extends the PH. Both VAE and VAE-GAN provide sufficient PH allowing for corrective maintenance.
The RUL prediction quality evaluated by the α − λ metric is illustrated in Fig. 4, which label either "True" or "False" by verifying whether the prediction falls within α (α = 0.1) accuracy when prognosticated at early (i.e. λ = 0.25) or halfway (i.e. λ = 0.5) to EoL from the first prediction is made. This is a more stringent requirement than staying within a converging cone of the error margins as a system nears EoL. Since the t λ may not be consistent with the frequency of the prediction step, t ′ λ which is closest to t λ is chosen. It can be observed that VAE and VAE-GAN predict more precise RUL than Decoder LSTM in the early and medium term.
As highlighted by the RULs provided in Fig. 4 and TA-BLE I, the predictions by all three models converge to the true RULs, which validates the assumption that prognostic performance improves as more information becomes available with time. Then, the RA metric in Equ. (25) is employed to quantify the accuracy levels. The RA by decoder LSTM is relatively higher and fluctuates more heavily than that of VAE. The VAE-GAN further lowers the relative error and flattens the fluctuation, proving to be a more accurate and stable predictive model. Since RA outputs error information at a specific time step, to assess the general error of models, Cumulative Relative Accuracy (CRA) is used to produce an aggregate accuracy level. The average RA at t λ and CRA of all test series are presented in TABLE II. Fig. 4 row 3 presents a convergence metric performance indicating the rate at which the relative accuracy improves with time. As stated earlier, convergence is the Euclidean distance between (t p , 0) and the centroid of the area under the RA curve from t p to the End-of-Useful-Predictions (EoUP). EoUP is introduced to express the minimum acceptable PH in demand to take maintenance. From the industrial perspective, any prediction made beyond EoUP is of little or no use since it does not leave enough time to carry out corrective measures. Considering the concept that lower distance implies a faster convergence, it can be seen in Fig. 4 and TABLE II that, compared to decoder LSTM, VAE and VAE-GAN are able to produce reliable predictions at earlier stages. Moreover, the convergence of VAE after adversarial training has been slightly improved.

1) Multivariate Time Series:
Theoretically and practically, the proposed model can be easily extended to handle multifeature input. Since the outputs at each time step are the parameters for a probability distribution of the next data S t , a multi-variate normal distribution should be employed in such case. For example, in order to satisfy two feature series input, a modified probability distribution (compared to Equ. (11)) that considering the correlation between two features (instead of i.i.d.) is adopted: where N is the probability distribution function for a bi-variate normal distribution, and ρ 12 i is the correlation parameter of each bi-variate normal distribution.
Accordingly, the output y t (Equ. (10)) is modified as: where y t ⊆ R 6M +2 , can be split into M mixed Gaussian distributions to describe f t = (f 1 t , f 2 t ) and one categorical (p 1 , p 2 ) distribution to describe health indicator HI. In addition to exp and sof tmax operations (Equ. (12) to Equ. (14)), tanh operation is applied to ρ to ensure ρ ⊆ [−1, 1].
Identically, a tri-variate normal distribution (with y t ⊆ R 10M +2 ) should be considered for three feature series input (f t ⊆ R 3 ), and a n-variate normal distribution (with y t ⊆ R (1+2n+C 2 n )M +2 ) for n dimensional feature (f t ⊆ R n ). The general formula for the n-dimensional normal density is: where f = (f 1 , · · · , f n ), µ = E(f ) and K is the covariance matrix.
To further explore the relation between prognostic accuracy and feature dimension, the proposed model is extended to match feature series with 2 or 3 dimensions, where the top n most representative features are selected based on Section III-A. The features 1, 2, and 3 represent the 11th (static pressure at HPC outlet), 12th (ratio of fuel flow to Ps30), and 13th (corrected fan speed) features, respectively. The comparative result is presented in Fig. 5 and TABLE III. It  can be observed that there is a slight increase in performance when f t ⊆ R 2 . However, when the feature dimension reaches 3, it dramatically deteriorates the prognostic performance. This is because the added feature 2 is highly correlated with feature 1 and offers little useful information (also known as feature redundancy) in terms of training. Additionally, using multiple input significantly increase the computational burden and thus is hard to achieve Nash equilibrium during the training process. When feature dimension exceeds 3, the training process become unstable and can not guarantee a convergence. One of the reasons is that each feature dimension has a different distribution even under the same degradation mode, it is hard  to precisely model all the different distributions at the same time. In future work, a multivariate adaptive prognostic models by generative adversarial learning will be further studied.
2) Threshold: In most prognostic methods, EoL is obtained when an indicator (e.g., selected feature) exceeds a predefined threshold. However, this will introduce additional concerns. Take CMAPSS data as an example, if the averaged feature value (11-th feature) at failure time is chosen as a threshold among 100 degradation cases, there is a 50% probability that the system will break down before reaching that threshold. If the minimum feature value at failure time is chosen, this will introduce a systematic error (±13.61) on RUL prediction even if the degradation prediction is 100% accurate. In this paper, the feature transformation method, that incorporates one-hot health indicator, enables the model to learn different EoLs and thus bypass low accuracy prediction produced from an imprecise pre-defined failure threshold.
3) Generalization: Similar as most of the classical machine learning based prognostic models, the proposed method in this paper needs enough run-to-failure historical data to achieve a significant performance level. Although the implementation of Gaussian components in our proposed model enables separately modeling different stochastic events and separately modeling scenarios governed by different rules [33], it is unable to produce an accurate prediction for new coming data that have large variations from the learning datasets (i.e., data follow a new degradation mode that has not been observed before). To tackle this problem, concepts such as physicsinformed [34] or domain adaption [35] machine learning models are encouraged.

V. CONCLUSIONS
This paper proposed a novel sequence-to-sequence variational auto-encoder with an adversarial learning approach for the prediction of long-term degradation progress and remaining useful life without defining a specific failure threshold. This approach used correlative and monotonic metrics to identify the critical features related to degradation, which are then concatenated with health indicator vectors to train the model. The VAE consists of a bidirectional LSTM based encoder and an auto-regressive LSTM-GMM based decoder, fully utilizing the capability of LSTM in learning long-term dependencies in time series data. The output of LSTM in decoder was connected to a fully-connected layer to map the output into the parameters of a Gaussian mixture model and a categorical distribution for sampling consequent predictions. Experiments on real-world health monitoring data of aircraft turbofan engines, wind turbines, and lithium-ion batteries verified the effectiveness and robustness of the proposed approach. Prediction is conditioned on the encoded previous observations, which enables multi-mode degradation prediction. The adversarial training helps the VAE better capture the distribution of real degradation progress, thus leading to a more accurate RUL prediction.