Developing a Data-Driven Unsupervised Pattern Recognition Approach for Sensor Signal Anomaly Detection

—Coming up with a system for early detection of machine damages and failures is one of the important challenges in the industrial maintenance procedure to avoid additional costs and downtimes. To approach this goal, this paper uses the signal gathered by a sensing system which employed a spintropic sensor to measure the magnetic ﬁeld around the machine which somehow shows the machine’s behaviour. Using this signal and focusing on analysing and processing the signal, this paper develops a data-driven method to recognize signal patterns and subsequently detects anomalies. A challenging task that we succeeded to overcome in this paper is recognizing relevant signal patterns without having any prior knowledge. An algorithm designed for this task is therefore completely unsupervised which makes it consistent and suitable to apply it for the signals gathered for other types of machines. Using both frequency and time domain information, the proposed algorithm, which utilizes signal processing and machine learning techniques, is able to efﬁciently identify relevant signal patterns. Clustering results on the real data gathered by the aforementioned sensor have shown the high accuracy of 99 . 38 % in recognizing patterns. Furthermore, an anomaly score measure is used and according to its distribution, anomalies are detected appropriately.

There are several anomaly detection techniques which are model-based [24], [25]. It means prior physical knowledge for modelling specific types of anomalies are required in modelbased methods. Machine learning techniques lead us to be able to identify anomalies with a limited requirement of prior Ensieh Iranmehr, Ricardo Ferreira, Tim Böhnert, and Paulo P. Freitas are with International Iberian Nanotechnology Laboratory (INL) 4715-330 Braga, Portugal (e-mail: ensieh.iranmehr@inl.int, ricardo.ferreira@inl.int, tim.bohnert@inl.int, Paulo.Freitas@inl.int) Fig. 1: Proof of concept demonstration for anomaly detection. Noninvasive tools provide the data (signal) representing the machine's behaviour. The signal is then fed to an intelligent system to extract the relevant patterns and anomalies. physical knowledge. These are called data-driven methods [8], [6], [26] which are principally rely on the information gathered by data acquisition system.
Data-driven methods can be categorized into three main categories: 1) supervised [14], [1], [8], [27], 2) semi-supervised [28], and 3) unsupervised [6], [26], [3], [23] anomaly detection. Since the labelled data, which is required for supervised anomaly detection, is often not available or in other words, collecting sufficient anomolous samples is infeasible in most of the cases, many researchers have tried to detect anomalies without labelled data. However, they were forced to take advantage of supervised learning along with unsupervised for other steps of their algorithms such as pattern recognition to achieve better diagnosis [29], [30].
Detecting faults in machines is a hot topic nowadays due to this fact that anomalies might cause a significant degradation in machine's performance leads to wasting huge amount of money and time. Although several valuable works have been published in the field of detecting specific faults in some machines [11], [12], [13], [14], [9], it still seems a long way ahead to reach an intelligent system which is able to early detect anomalies automatically without having any knowledge about the machines. In this paper, anomaly does not mean bad behaviour, but rather rare/new/unexpected behaviour, which if not detected may lead to bad behaviour of the machine. Hence, this paper tries to get closer to this aim by designing a scheme for machine early anomaly detection using machine learning and signal processing techniques. Fig. 1 demonstrates the proof of concept for non-invasive machine early anomaly detection. A spintronic sensor based on tunnel magnetoresistance (TMR) [36], [37] is placed nearby a machine which measures a changing magnetic field around that machine. This sensor was previously used in some other proof of concepts [38], [39], [40] such as real-time interaction system for eye movement gesture control [38] which successfully classified eye movements. In this work, the TMR sensor's output signal which somehow shows the machine's behaviour is then fed into an intelligent system in order to detect signal patterns and subsequently anomalies. The latter forms the basis of present study in which an algorithm is proposed that can recognize patterns and anomalies efficiently.
In this paper, it is assumed that we do not have any prior physical knowledge about the machine. Hence, using machine learning techniques, we want to get as much information as it is possible from the TMR sensor's output signal by developing an unsupervised data-driven approach. Using this data, we want to detect abnormal behaviour of the machine before damage, failure or breakdown. Hence, the important task that we are going to do in this work is to recognise signal patterns in the input data unsupervisedly. Each signal pattern represents a state of the machine. Thus, by recognizing the patterns, we are able to understand the machine's behaviour. Deviation from average of signal patterns, informs that something unusual has happened which leads us to predict the failure or damage before it happens which is a very essential and challenging step in maintenance.
There are many supervised and unsupervised pattern recognition techniques including different types of neural networks and machine learning techniques. Depending on the data type and the problem to be solved, different algorithms have been developed so far. Among them, feed-forward [41], [42], recurrent [43] and reservoir neural networks [44], [45], [46] should be mentioned as powerful tools to handle labelled data. If we encounter a problem that data labels are not available, we can benefit of some other techniques such as dimension reduction [47], clustering [48], [49] to recognize patterns. The latter along with some signal processing techniques was used in our proposed algorithm since a completely unsupervised solution for the problem is our desire. Using patterns extracted from the input data, this paper used one of the simplest statistical technique (Z-Score) to compute anomaly score. Besides, another method was used to find rare pattern sequences which were then considered as anomalies.
Concisely, the main contribution of this paper is: 1) Developing a data-driven completely unsupervised algorithm for recognizing patterns; 2) Using hybrid time-frequency domain information in the proposed pattern recognition algorithm; 3) Utilizing some statistical techniques to find anomalies; 4) Conducting a proof-of-concept validation on real data gathered by the aforementioned sensor.
The paper is organized as follows. Section II focuses on the intelligent system designed for patterns recognition and anomaly detection. In this section, we explain in detail the whole algorithm analysing hybrid time-frequency domain information using signal processing and machine learning techniques and also explains the algorithm used for finding anomalous using some statistical techniques. Section III includes the results obtained from every step to achieve the goal. Moreover, performance of the proposed algorithm relying on several evaluation parameters is studied. We also discuss advantages of the present study and express some future works. Finally, we conclude the paper in section IV.

II. PROPOSED SCHEME FOR PATTERN RECOGNITION AND ANOMALY DETECTION
This section explains the scheme proposed for pattern recognition and anomaly detection. An overview of the whole algorithm designed for this work was demonstrated in Fig. 2. In this figure, the main steps of the proposed algorithm, the input/output of each step and the methods used for every step was briefly shown. As it can be seen in Fig. 2, the proposed scheme consists of 4 main steps: 1) feature extraction; 2) event extraction; 3) signal pattern recognition; and 4) anomaly detection. By combining different methods and concepts, we were able to provide a completely unsupervised data-driven system for signal pattern recognition. Then, using the recognized signal patterns and by using some simple statistical techniques, anomalies were detected. In the following subsections, we discuss about all of the steps in detail.

A. Extract Features from Signal
Here, the methods used in this paper to extract features from the signal gathered by the TMR sensor is explained. This main step consists of three steps. First, by designing a filter to remove poweline harmonics, the signal is filtered and it is used for time-domain analysis. The signal is segmented into the same-sized windows. Each signal segment is then windowed with a smoothing window function. Afterwards, to extract the frequency domain features, each signal segment is transformed into the frequency domain. Frequency components related to powerline harmonics are removed and the remaining frequency components are considered as features. Thus, we take advantage of both ways (time and frequency) to look at the signal.
1) Filtering the signal: To filter the powerline harmonics, a finite inpulse response (FIR) multi-band stop filter was designed, shown in Fig. 3. As it can be seen from this figure, only frequencies in some frequency bands related to the powerline harmonics are attenuated and the other frequencies are passed. The order of filter was considered N I = 400 and to approach an optimal filter design, we used the equiripple method. If x[n] is the input signal, h[n] is the impulse response of the designed filter, the filtered signal y[n] is computed by convolving x[n] with h[n] (Eq. 1). In this equation, b i denotes 2) Segmenting the signal: The length of signal to be transformed into the frequency domain is so large that we have to subdivide it into smaller signal segments. Otherwise, the desired frequency resolution can not be achieved. Depending on the sampling rate frequency of the signal (F s ) and the fact that to reconstruct a signal to its frequency domain, new length is the next power of 2 from the signal window, the length of each window was set (S w ). Otherwise, the signal will be padded with trailing zeros. Lets assume that the number of signal windows is N w . Also, to avoid the edges of each signal window from being lost, the signal windows should overlap with each other for which an overlap of O w was considered.
3) Transforming into frequency domain: To deconstruct a time domain representation of a signal into its frequency domain, Fast Fourier Transform (FFT) is used. Due to the significant limitation of FFT based on the fact that FFT interprets two endpoints of the signal connected to each other, we improved the signal clarity using windowing before applying FFT. Hence, we applied Hanning, a smoothing window function, since it is generally more satisfactory compared to other window functions. Smoothing window reduces the amplitudes of boundaries of the signal window toward zero, smoothly and gradually. If x[n] is a signal window, the windowing result y[n] is computed using Eq. 2 where w[n] is the Hanning window. Afterwards, y is deconstructed to frequency components using Eq. 3. Finally, removing the frequency bands of powerline harmonics from the magnitude spectrum leads to extraction of features. In the following, the windowed signal segments were denoted by {w 1 , w 2 , ..., w Nw } and the frequency components of magnitude spectrum before and after removing powerline harmonic frequencies were shown by

B. Extracting Events
Here, how to extract events from the frequency domain features and the time domain information is explained. A dimension reduction technique is applied on the frequency domain features to transform the high dimensional features to low dimensional components. The extracted components are then fed into a k-means method in order to be clustered. Using some techniques based on cross-correlation (CC) method, some of the event clusters are selected and then divided into two different groups in order to distinguish between the signals having similar frequency domain features but different time domain information.
1) Reducing dimension of features: To transform highdimensional features into a low-dimensional features so that the latter retains most of the meaningful information, we can take advantage of dimension reduction techniques. We used principle component analysis (PCA) [47] which is an important linear technique that performs mapping of data to a lower dimensions by maximizing the data variance in low dimensional representation using singular value decomposition (SVD). The number of principle components was determined in such a way that low dimensional representation contains a specific percent of total variability (V P C ), denoted by Clustering into event clusters: So far, each signal segment has been represented by its extracted principle components. Here, how all of the signal segments cluster into different event clusters using their P Cs is explained. In other words, we want to put the signal segments, whose behaviours are much alike, in a same group. Hence, the similarity between their corresponding P Cs should be measured with an appropriate measure. Algorithm 1 presents the algorithm proposed to cluster the signal segments. The proposed algorithm consists of three parts: 1) first level clustering; 2) find relevant event clusters; and 3) second level clustering.
In the first level clustering, all the signal segments cluster into different groups. Depending on the initial centroids or seeds, the clustering result will be different. Hence, we run the k-means method in several iteration (N itr ), compute withincluster sums of point-to-centroid distance for each iteration (Dist within ), find the iteration with minimum distance (I), and finally, run the 1-iteration k-means using the centroids computed in I th iteration. For this purpose, we used the wellknown k-means which is a centroid-based clustering method [49] for the first level clustering. Afterwards, the most frequent events is considered as background.
So far, the clustering has been done based on frequency domain information. What if in some signal segments, the frequency components of magnitude spectrum are similar while in the time domain are inverted? To overcome this challenge, it is first required to find the relevant clusters and then run the second level clustering algorithm on these clusters. To find relevant event clusters, the transition probability between event clusters is calculated. The relevant clusters are the ones that have not happened too much and the variance of its transition probabilities are not minimum or maximum.
In the second level clustering, each of the signal segments belonging to the relevant cluster are compared with its first signal segment using cross-correlation after normalization. Eq. 4 expresses the cross-correlation between two signals where S 1 denotes the conjugate of S 1 . If two signals are very Algorithm 1 Clustering the PCs into Event Clusters

Assumption
Number of Clustering Iterations = N itr Number of Event Clusters = k Input/Output Input: Principle Components (P Cs) of signal segments Input: Signal segments (S e ) Output: Event Clusters (C E ) Clustering Level 1 for every iteration i = 1 to N itr do Cluster P Cs using k-means Find within-cluster distances (Dist within (i)) end for I = argmin i (Dist within ) Run 1-iteration k-means using I th centroids Remove the most frequent events (background) similar, the most prominent maximum (P max ) is significantly larger than the most prominent minimum (P min ), whereas the opposite is true if the signals are inverted. As a result, if the most prominent extremum is maximum, the signal segment belongs to a same cluster. Otherwise, it belongs to a new created cluster.

C. Recognizing Signal Patterns
The algorithm designed for recognizing signal patterns consists of three steps, which are explained here. First, the consecutive events are subdivided into different segments depends on their time of occurrences. Afterwards, the segments that were repeated the most are found and clustered into appropriate groups using a density-based clustering algorithm applied for the cross-correlation results. By proposing a similarity measure, each segment is compared with the segments repeated a lot in order to find the appropriate group to which this segment belongs. After grouping all the segments, a dataset for each pattern cluster containing the corresponding segments' signal patterns are provided.
1) Segmenting the consecutive events: So far, we could identify the events that are presented by their corresponding event clusters (C E ) and their time of occurrences. Although in some part of the signal, no event has been identified because those part detected as background, on the contrary, some of these events occurred consecutively. Here, the consecutive events whose time of occurrence is near (less than t) are considered as a sequence. As a result, the events are transformed into the fewer number of sequences, called sequence segments (Seg).
2) Clustering the frequent sequence segments into pattern clusters: Of all the sequence segments, some have happened a lot. However, their corresponding signals might be very similar. Hence, we first find the frequent segments and then cluster them in order to put the similar frequent sequence segments into a same group. Algorithm 2 expresses the way to cluster these frequent sequence segments (third level clustering). In this algorithm, we first extract the features appropriate for third level clustering, which is based on the results of cross-correlation between corresponding signals (S f ) of each pair of frequent segments. By taking advantage of the same method used in section II-B2, a matrix (M cc ) with the size of (N f × N f ) is created. It is assumed that each row of M cc represents an observation with N f features which are then fed into a density-based spatial clustering of applications with noise (DBSCAN) [50]. Due to its advantage not having to specify the number of clusters, we chose DBSCAN method for the third level clustering. DBSCAN clusters the observations based on two parameters epsilon and minpts which indicate neighbourhood search radius and a minimum number of neighbours, respectively. If some of the frequent sequence segments represents noisy signals, then DBSCAN clusters them in separate groups. To group these sequence segments in one cluster, before applying DBSCAN, the maximum value of cross-correlation is computed (M ax cc ) and for each frequent segment, its average (M ean cc ) is computed. If it is less than a coefficient (β) multiplied by the total average (µ cc ), they are considered as the first pattern cluster.
3) Grouping all the sequence segments using a similarity measure: So far, we figured out how many pattern clusters exists in the dataset. Here, we want to allocate each sequence segment or its corresponding signal (signal pattern) to an appropriate pattern cluster. For simplicity, instead of sequence segments/signal patterns, we will say samples. Algorithm 3 expresses the way to measure the similarity of all samples to pattern clusters and finally, consider it as a member of the most similar pattern cluster set. The similarity measure is based on both the frequency and time domain information, i.e. sequence segments and their corresponding signal patterns. As it can be seen from the algorithm, distance (Dist) between every sequence segment and the frequent ones are computed, and the index with minimum distance (I1) determines its label if minimum distance is short and the cardinality of I1 equals to 1 (cardinality of sets were shown by |.|). Otherwise, we take advantage of time domain analysis using the same method mentioned in section II-B2 to measure similarity between signal patterns and to find the index with maximum similarity (I2). The intersection between I1 and I2 determines label of the sample if it is not empty. Otherwise, the maximum index of M ax cc determines the label. As a result, all samples are recognized and now, we are able to make dataset for each pattern cluster consisting of all the samples belonging to it. It should be noted that the function used to compute (Dist) between two sequence segments returns the minimum number of elements which must be omitted from each of them or both, so that the remaining elements of sequence segment are the same.

D. Detecting Anomalies
Up to now, we explained all the details about recognizing signal patterns. Here, the proposed method to find anomalies relying on the results of previous steps is described. We are supposed to detect two types of anomalies in this work: one is the samples deviated from the norm of the pattern cluster to which it belongs, and two is the sequence of patterns which happen rarely. For detecting the first type of anomaly, we compute the anomaly score and by using a threshold, anomalies are found. Whereas for the second type of anomaly, we compute the transition probability between each pair of Count elements of I belonging to C P s (N EI ) pattern clusters. In this way, the most frequent sequences of patterns and also the sequences which happens rarely can be found using the transition probability. Algorithm 4 expresses how to identify anomalies.
1) Detecting anomalous samples: Sometimes, the samples are categorized in a wrong cluster or differ significantly from the corresponding cluster norm. If the latter happens, it may be due to the poor condition of the machine or in other words, it may indicates a malfunction of the machine that can lead to the machine's breakdown. Hence, detecting anomalous samples is a very important task to prevent failures. To detect anomalous signal patterns or sample, a simple statistical method based on Z-Score is used. Eq. 5 expresses how to compute Z- Zero-padding the signals belonging to i th cluster µ i = S i (1) for every sample in i th cluster j = 1 to N si do Find lag between S i (j) and µ i Shift µ i by lag Score where µ, σ and x represent mean, standard deviation and a sample, respectively. The scoring method used here is similar but slightly different. The difference is that the score is calculated in such a way that is always positive.
In the method used here, the average of all samples belonging to each pattern cluster is computed. Afterwards, the deviation of each signal pattern from the average indicates the degree of anomaly. As it can be seen from Algorithm 4, to compute the average of all signal patterns, all the signal patterns must be changed to a same size using zero-padding which simply increases the signal's length by adding zeros to Fig. 4: Steps to extract information from the signal. A small part of original signal was shown here filtered using the designed FIR filter to be suitable for time domain analysis. On the other hand for frequency domain analysis, the original signal were segmented to the same-sized windows each of which was then windowed by a Hanning window. Afterwards, it was converted into the frequency domain using FFT. The magnitude spectrum of FFT represents the frequency components (FCs). The FCs relating to the power line harmonics were removed and the remaining FCs were considered as features.
the start and end of it. Then, by using cross-correlation, the delay (lag) between signals is computed. By shifting one of the signal as much as the computed lag and then summing all the same-sized shifted signals, the average of i th pattern cluster samples (µ i ) and the standard deviation of i th pattern cluster are computed. If the Score is less than a threshold (ν), the sample deviation from the mean is small and it will be considered as normal or non-anomalous sample. Otherwise, the deviation from mean is large and we can consider it as anomalous sample.
2) Detecting anomalous pattern sequences: It is worth mentioning that the disposition of signal patterns in relation to each other is important because each signal pattern indicates a specific state or mode of the machine. Hence, the order of signal patterns recognised in the data can be referred to basic information of the machine's behaviour. Here, we are going to distinguish common signal patterns from rare signal patterns, which could indicate anomalous sequences. For this purpose, state transition probability between every pair of pattern clusters occurring consecutively is computed. The state transition probability between i th and j th pattern clusters is computed by prob(s m+1 ∈ C i P |s m ∈ C j P ), that indicates the conditional probability of the (m + 1) th sample belonging to C i P given the (m) th sample belonging to C j P . By computing the probability of each pattern cluster occurring and multiplying with its corresponding state transition probability, transition probability matrix (T = {t ij } Nc i,j=1 ) is computed. The matrix T indicates the occurrence probability of different pairs of pattern clusters. In this algorithm, |C i P | is cardinality of the set of i th pattern cluster. If the transition probability between pairs is less than a threshold (θ), i.e. these transitions are rarely happened and they may be anomalous sequences. Anomalous sequences may be the order of more than two signal patterns. In this case, we can find the transition probability between more than 2 pattern clusters and find the ones less than threshold as rare pattern sequences.

III. RESULTS AND DISCUSSION
In this section, performance of the system designed to recognize patterns and detect anomalies is presented. The simulation results of each main step are shown separately and also the signal patterns which are detected as anomalies are shown. To fully investigate performance of the proposed system, we labelled the patterns manually and then, using some evaluation parameters, the actual results was compared with the desired ones. Finally, the advantages of the proposed approach and some future works are discussed.

A. Results of the algorithm's main steps
Here, the results of proposed algorithm's main steps are depicted. Fig. 4 shows the results of every single step to extract frequency features for frequency domain analysis as well as filtered signal for time domain analysis. As it is seen, for frequency domain analysis, the original signal was first segmented and windowed to the N w same-sized windows which the amplitude of their boundaries were gradually decreased. Each window was transformed into frequency components (F Cs) using FFT and the F Cs related to power line harmonics were then removed. The remaining components indicates the frequency domain features. As a result, we have N w window segments each of which represented by N features. Fig. 5 shows the results of first and second level clustering of the principle components (P Cs) to extract events. As it is seen, the features of each window segment was transformed into the lower dimensions. Each window segment was therefore represented by its corresponding n dimensional principle components. The window segments were then clustered using the k-means method that its result was shown in the scatter plots. The window segments belonging to a specific cluster were shown by an individual sign and colour. In the first level clustering, the window segments were first clustered into different groups. The optimal value of k is automatically determined by an internal evaluation of the clustering for a range of different values of k. Here, k = 5 was determined by the algorithm with one cluster representing the background, which leaves 4 event clusters. However, during the conversion into the frequency domain, some time domain specific information Fig. 5: Displaying the results of how to extract different events from the features (F C). First, PCA reduced the dimension of each window's features and transformed it to a smaller number of components (P C). The P Cs extracted from all windows was then clustered into k event clusters (Clustering Level 1). Then, using a proposed method based on cross-correlation, some of the clusters were chosen and each of which was then split into 2 different clusters (Clustering Level 2), leading to more than k event clusters. Fig. 6: Depiction of the steps to recognize signal patterns. The consecutive events were segmented based on their time of occurrences. Then, the k-most frequent segments (k = 16 in this case) were selected. By running cross-correlation between each pair of frequent segments' corresponding signals and then by taking advantage of DBSCAN method, the similar items were grouped automatically into clusters. Fig. 7: Depiction of the pattern clusters' seeds or averages and displaying some non-anomalous, soft-anomalous and anomalous samples. The average of all signal patterns belonging to every pattern cluster were computed, shown in first column. Then, by computing anomaly score based on Z-Score, a non-anomalous, soft-anomalous and anomalous sample belonging to each pattern cluster were shown in second, third and forth columns, respectively. Fig. 8: Anomaly distribution of every pattern cluster. The histogram of anomaly score considering all of the signal patterns belonging to each pattern cluster was displayed here. Also, a density function fitted to each histogram was drawn to show the anomaly score distribution. got lost. For example, this dataset shows a characteristic peak that is either positive or negative in the time domain, but indistinguishable in the frequency domain. Thus, using the proposed cross-correlation-based method in second level clustering, these different behaviours become distinguishable. It is worth mentioning that this method is not specific for this dataset and it could be applied to a general dataset. Because if all of the samples belonging to a cluster are similar in time domain, all of them will be remained in the same cluster.
In this dataset, the relevant clusters were this occurred are Cluster 3 and Cluster 4 and they were split into two clusters using the cross-correlation based method resulted in the total 6 different clusters (C E ), as shown in Fig. 5. As a result, except the window segments related to the background, others were considered as different types of events.
Considering the time of events occurrences, they were segmented into sequence segments which were then used to find the frequent ones. Fig. 6 shows the corresponding signals of N f = 16 most frequent sequence segments which were then clustered using DBSCAN method. As it can be seen, they have been clustered into 5 different pattern clusters (C P ). By measuring similarity between each sequence segment and the frequent ones, it was grouped into one of the 5 pattern clusters, which led to the recognition of all signal patterns.
To find anomaly, the average of all signal patterns existing in each pattern cluster was calculated and the deviation of each signal pattern from the mean indicates degree of anomaly. The mean or seed (µ) of second to fifth pattern clusters have been shown in the first column of Fig. 7. For each pattern cluster, a non-anomalous or normal, a soft-anomalous and an anomalous sample have been shown. Higher score indicates higher probability of an anomalous signal pattern. From all the samples for non-anomalous signal patterns, it is obvious that the anomaly score less than a value leads to non-anomalous samples. Whereas, the signal patterns with anomaly score more than this value can be considered as soft-anomalous and anomalous samples depending on the anomaly score distribution of each pattern cluster. The anomaly distribution of each pattern cluster has shown in Fig. 8. As it can be seen, the total standard deviation of first, third and fifth pattern clusters are much smaller than second and forth ones, leading to narrower distributions. Moreover, it is seen that the anomaly score range in the second and fourth pattern clusters are less than the rest. According to these distributions and the score range for each pattern clusters, we can set the parameters to find soft-anomalies and anomalies. If density curve and its most prominence local maximum are denoted by D and D max , respectively, the thresholds to find soft-anomalies and anomalies can be computed using Eq. 6 and Eq. 7.
It is also clearly seen in Fig. 8 that the density curve of cluster 2 and 4 have two peaks unlike the other clusters which have one peak. This happened due to the two different width of signal patterns belonging to each of these clusters. It should be noted that because of this, the mean of these pattern clusters (Fig. 7) are a little different from their corresponding nonanomalous samples. In this case, two set of thresholds should be set, each of which should be applied for the corresponding samples. For this purpose, two representative samples, each of them represents a peak, are compared with all the samples belonging to the cluster and the samples are grouped to a more similar one (G 1 or G 2 ) using a cross-correlation based method. If two most prominence local maxima and the most prominence local minimum are denoted by D max 1 , D max 2 and D min , respectively, soft-anomalies and anomalies samples can be computed using Eq. 8 and Eq. 9. Using the parameters specified in Table II, the threshold values were computed and mentioned in Table I and based on these values, the softanomalous and anomalous samples were found.
Also, very rare sequences of signal patterns are considered as sequence anomalies. As mentioned earlier, they were found using the probability transition matrix between different pattern clusters. The sequence anomalies were extracted and their corresponding signals can be shown as anomalous sequences. Some of the anomalous sequences have been shown in Fig.  9.b. As seen in this figure, each anomalous sequence may consists of different number of signal patterns. The sequence of corresponding pattern clusters and their anomaly scores have been displayed above each sequence. To see the difference between these rare sequences and the normal sequences, two most repetitive sequences of signal patterns have been shown in Fig. 9.a.

B. Evaluation Process
All of the parameters necessary for the proposed algorithm have been specified for this real data and shown in Table  II. Also, the sampling rate and the duration of dataset has been shown. Since the ultimate aim of this work is to figure out the machine's behaviour quite unsupervisedly, a small part of dataset (about 10 %), that is unlabelled, was used in the training stage in order to extract effective and helpful information, which were then used in the test stage. The information extracted from training stage and used in test stage is the coefficients of PCA, the centroids of first level clustering, the clustering results of third level and the mean of pattern clusters. Using this information, we could diagnose anomaly.
To evaluate the scheme designed for signal pattern recognition, we used different evaluation parameters including Accuracy (Acc), Precision (P rec), Sensitivity (Sen), Specificity (Spec), F1Score (F 1S) and Purity (P u) which are expressed by the equations in Eq.10. In these equations, the number of true positive, false positive, true negative and false negative are denoted by T P , F P , T N and F N . Also, N d is the total number of signal patterns, N c is the number of clusters, C i P is i th pattern cluster and T arget j is the desired target of j th pattern.
The performance of the proposed clustering model using the above evaluation parameters have been shown in Table III. All of these parameters were computed for training and test sets.
Moreover, confusion matrix was computed for training and test sets individually and the results have been shown in Table IV. The target clusters was compared with the predicted output clusters. The diagonal cells indicate the number of total observations which were correctly recognized. Whereas, the off-diagonal cells indicate the number of observations which were incorrectly clustered. In these both confusion matrices, the far right columns show the precisions of every clusters, while the lowermost rows show the sensitivity of every clusters. The overall accuracy was shown in bottom right cell.
Furthermore, using the proposed method finding the softanomalies and anomalies of signal patterns, the number of samples in every cluster detected as soft-anomalies and anomalies were brought in Table V. Using the total number of samples belonging to each pattern cluster, the percentage of soft-anomalies and anomalies were computed and shown. It can be seen from this table that 3.88 % and 5.41 % of all the signal patterns were detected as anomalies and soft-anomalies, respectively.

C. Discussion and Future Works
Applying the proposed algorithm on the real data gathered by a sensor, led us to conducting the proof of concept validation. Also, the high accuracy, precision, sensitivity and other evaluation parameters demonstrate high efficiency of the proposed algorithm in pattern recognition. Considering both time and frequency domain information in different steps of the proposed algorithm enabled us to achieve these promising results. The time domain information was used to distinguish upward and downward shape of the patterns, which significantly increased accuracy and evaluation of other parameters. The anomalies could be detected efficiently because of its dependence to the results of pattern recognition. Also, it should be noted that the proposed algorithm requires reasonable resources, which can be easily run in real-time.
An important advantage of the proposed algorithm relates to its learning which is totally unsupervised and will be useful in cases where no prior knowledge is available. This property made the algorithm suitable for analysing other signal data and subsequently recognizing signal patterns. One of the future work we are going to do is gathering more data from the aforementioned sensor to figure out how well the proposed algorithm works for other signal data. It is worth mentioning that the algorithm was designed in such a way to have the fewest parameters and least need to initialize them. However, to increase its generalization, we are going to use some optimization algorithm to set the parameters automatically.
Other advantage of the proposed algorithm relates to the size of the training set compared to the test set. We used almost 2 hours of data for training stage and the rest (about 20 hours) for the test stage. Actually, we could figure out the machine's states and their transitions just by using a small amount of data. Using this information extracted from the training stage, we are able to analyse all the data that will be collected for the same machine. Moreover, as it can be seen from the accuracy of the results, the training of the algorithm on the training set was representative for the test set. In addition the unsupervised learning, enables the algorithm to train on easily available data without prior knowledge of the system and achieve high reliability.
In the future, we plan to associate the detected anomalies with the specific behaviours of this machine and enable machine operators to teach an additional algorithm layer, which anomalies are relevant for machine maintenance and failure.

IV. CONCLUSION
In this paper, a data-driven intelligent system has been developed for the non-invasive machine early anomaly detection to avoid machines' breakdown. The approach proposed for this system is totally unsupervised, depending only on the input data. The algorithm consists of different levels so that the input of each level of data analysis is the results of previous level. The cascaded structure of the proposed algorithm which was designed by taking advantage of hybrid time-frequency information, created a strong and robust clustering method that led to a high accuracy in pattern recognition. The recognized patterns were then used for anomaly detection relying on the deviation from the norms of patterns and their sequences. The proof of concept was evaluated using a real data gathered by the spintropic TMR-based sensor and the results shows high performance of the proposed algorithm. One advantage of this work relates to the extraction of useful information from a very small amount of data, which allows the algorithm to be performed on a huge amount of unseen data successfully. As a result, this paper provided an interesting approach for recognizing signal patterns and thereupon identifying anomalies.
Ensieh Iranmehr is a Research Engineer of the Technology Engineering Group at International Iberian Nanotechnology Laboratory (INL). Her focus is on using machine learning techniques to overcome the challenges of real problems. She received her PhD in Digital Systems from the electrical engineering department of Sharif University of Technology in 2020. Her works revolve around artificial neural network, machine learning, neuromorphic engineering and digital systems. For her PhD project, she has proposed a new neuromorphic structure of spiking neural network inspired by biological studies called ILS-based Reservoir Network. She also has MSc in Digital Electronic Engineering from the electrical engineering department of Amirkabir University of Technology. In her MSc project, she designed a system for tracking vehicle using image processing, and artificial neural networks. Ensieh involved in several academic projects considering artificial intelligence techniques in different applications and implement some of these algorithms on FPGA and GPU for accelerating the process and made them suitable for real-time tasks. She currently has several peer-reviewed publications.
Ricardo Ferreira received the Ph.D. in physics engineering from Instituto Superior Técnico (IST) in 2008 upon his work on ion beam deposited magnetic tunnel junctions targeting hard disk drive read heads, nonvolatile memories, and magnetic field sensor applications. He is a Researcher with the International Iberian Nanotechnology Laboratory (INL), where he is the Principal Investigator of the Spintronics Research group and the Coordinator of the ICT research area. INESC Microsistemas e Nanotecnologias was the host institution during his Ph.D. and also the lab where he conducted his research as a Post-Doctoral Fellow until joining INL. At INL, the work has been focused on the production of MTJs using MgO barriers and the development of two high-yield and highuniformity fabrication processes on 200 mm wafers. The current research goals include production of pT magnetic field sensors, the production of devices that explore the spin transfer effect (STT nano-oscillators, MRAM cells), the incorporation of perpendicular magnetization materials on MTJs, and the integration of MTJs with MEMS and CMOS devices. He is the coauthor of 80+ peer-reviewed publications.
Tim Böhnert finished his study of physics at the University of Hamburg, he was hired as a scientific assistant by the cluster of excellence "Nanospintronics". He received his PhD degree from the University of Hamburg (Germany). He joined INL within the Spintronics Research Group to work for SpinCal project (EXL04). In 2018, he took a position as Staff Researcher at INL with focus on MTJ based magnetic field sensors. He succeeded at magnetic imaging of microstructures using an array of 65000 integrated MTJs on a 4x4mm CMOS chip and fabricated an embedded magnetic field sensor demonstrator, which was successfully applied in multiple industry production facilities for smart sensing and predicative maintenance. He contributes in FET Open SpinAge project and he develops weighted spin torque nano-oscillators for neuromorphic computing systems. He currently has 38 peer-reviewed articles.