Binary Spectrum Feature for Improved Classiﬁer Performance

—Classiﬁcation has become a vital task in modern machine learning and Artiﬁcial Intelligence applications, including smart sensing. Numerous machine learning techniques are available to perform classiﬁcation. Similarly, numerous practices, such as feature selection (i.e., selection of a subset of descriptor variables that optimally describe the output), are available to improve classiﬁer performance. In this paper, we consider the case of a given supervised learning classiﬁcation task that has to be performed making use of continuous-valued features. It is assumed that an optimal subset of features has already been selected. Therefore, no further feature reduction, or feature addition, is to be carried out. Then, we attempt to improve the classiﬁcation performance by passing the given feature set through a transformation that produces a new feature set which we have named the “Binary Spectrum”. Via a case study example done on some Pulsed Eddy Current sensor data captured from an infrastructure monitoring task, we demonstrate how the classiﬁcation accuracy of a Support Vector Machine (SVM) classiﬁer increases through the use of this Binary Spectrum feature, indicating the feature transfor-mation’s potential for broader usage.


I. INTRODUCTION
Supervised learning in regards to classification builds models of the distribution of given class labels in terms of given predictor variables (or features). The learned models (known as Classifiers) can then serve to assign class labels to provided testing instances where the predictor variable values (or features) are known, but the class labels are unknown [1]. Performing classification in such manner has become a common and vital component in the modern-day use of Artificial Intelligence. Many techniques such as Decision Trees, Discriminant Analysis, Perceptron-based techniques (e.g., Neural Networks), Logistic Regression, Bayesian Networks, Instance based learning (e.g., Nearest Neighbour Classifiers), Ensemble Classifiers, and Support Vector Machines (SVM) have been developed to learn and perform classification [1], [2], [3], [4]. More recently, deep learning based classification techniques too have been developed [5].
Over-fitting, lack of accuracy, and computation cost are some of the commonly encountered challenges when developing classifiers. It can be understood that over-fitting and computation cost become issues especially when working with high dimensional data. Feature selection (or feature 1  reduction) is a commonly followed practice to overcome the curse of dimensionality. That practice in return does on occasions help alleviate some of over-fitting and accuracy-related issues as well. Following literature, one can categorize the methods available for feature reduction to be three-fold: (1) Filter methods; (2) Wrapper methods; and (3) Embedded methods [6], [7]. In our paper, we consider the case where an optimal feature selection has already been carried out. That means, we focus on a supervised learning classification task that has to be performed with a given set of features, with no further feature reduction or feature addition being allowed. We assume the features to be real and continuous-valued for this paper. Now suppose there is some benchmark accuracy that can be achieved by using the feature set as it is. Then, we ask the question, whether that benchmark accuracy can be overtaken by performing some transformation to the existing feature set. We contribute in this paper by answering that question, via introducing a feature transformation we name the "Binary Spectrum feature transformation".
Derivation of the Binary Spectrum feature is presented in detail in this paper. Following derivation, we demonstrate the effectiveness of this Binary Spectrum transformation by benchmarking accuracy, and overtaking that benchmarked accuracy of an SVM-based classification task. Classification is performed on some Pulsed Eddy Current (PEC) sensor data. This dataset has been collected from an automated infrastructure monitoring exercise performed on a ferromagnetic critical water pipe [8], [9]. The class labels are reflective of the thickness of the pipe wall on points where sensing has been done. Descriptor variables happen to be some real continuous-valued features extracted from the corresponding PEC signals [8], [9], [10], [11].
The Binary Spectrum feature transformation happens to transform a given set of real continuous-valued features to a set containing both continuous-valued and discrete-valued categorical-like data (i.e., binary data). This categorical-like data subset is derived from the original continuous-valued dataset. The rationale behind the evident improvement in classification accuracy resulting from this transformation can be argued to be the effect of this combination of two data-types: (1) The natural continuous-valued features; (2) Categorical-like component derived from the natural features.
The structure of the paper is as follows: Section II mathematically formulates the problem of improving the accuracy of a given classifier, or a classification task; Section III presents the derivation of the Binary Spectrum feature transformation and presents an algorithm to find best performing classifiers; Section IV presents the effectiveness of the proposed method via a demonstrative example performed on

II. PROBLEM FORMULATION
We consider a binary classification (i.e., two-class classification) supervised learning problem that has to be solved making use of continuous-valued features. As such, let there be a given binary classifier, trained by the training data X t ∈ R a×b and Y t ∈ B a where B a denotes an a × 1 vector of binary digits. We make the following assumption about the training data.
Assumption 1: The training features (i.e., X t ) is an optimal subset of training features, i.e., no further feature reduction or feature addition is to be done.
Assumption 2: The two classes in the set of training labels (i.e., Y t ) are evenly (or equally) populated, i.e., there is approximately a 50:50 population ratio between the two classes.
The vector Y t ∈ B a containing training labels (or the training targets) is given by Y t = y t1 y t1 . . . y t1 y t2 y t2 . . . y t2 T 1×a (1) where y t1 = 0, y t2 = 1, and [ * ] T denotes matrix transpose. The corresponding training features contained in X t (∈ R a×b ) are given by where i, j ∈ Z + are generic subscript notations where 1 ≤ i ≤ a and 1 ≤ j ≤ b.
Similarly, the testing dataset on which the classifier is to make predictions, is given by the corresponding matrices X te ∈ R c×b and Y te ∈ B c . Y te = y te1 y te1 . . . y te1 y te2 y te2 . . . y te2 T 1×c (3) Here, y te1 = 0, y te2 = 1.
With this data, we define the following operations o(X t ) = X t u and o(X te ) = X te u with respect to u ∈ R b×b . u here is an orthonormal basis of X t . Now suppose, a classifier trained with the above defined data (i.e., o(X t ),Y t ) is given, and it is denoted as C, C : R d×b → B d . This classifier would predict the classes (i.e., 0 or 1) for the testing data X te . The prediction output will come in a vectorŶ te ∈ B c given as follows.
The vector err is then defined in order to compute the classification accuracy.
Locations of err corresponding to instances of a correct prediction having been made would carry zeros. Therefore, we define the total number of zeros in err as sum 0 . With that we define the classification accuracy acc in the following manner.
As such, acc can be represented as a function in the following manner.
The objective now will be to increase the classification accuracy (i.e., to increase acc) using the same set of features X t and X te without any reduction or addition of features respecting Assumption 1. Accomplishing this objective under Assumption 1 is the reason for this paper introducing the Binary Spectrum feature transformation.

III. DERIVING THE BINARY SPECTRUM FEATURE
Binary Spectrum transformation is performed by applying a function f , f : R e×b → B e×bn , n ∈ Z + to features X t and X te in the following manner.
Here, the matrix sizes come as a × b(n + 1) for X tb and c × b(n + 1) for X teb . Also, f (x ti j , n) and f (x tei j , n) for any x ti j ∈ R or x tei j ∈ R, is defined as where f does some scaling to x i, j and rounds it off to the nearest integer (discussed in the remainder of this section), and produces . . . . . . . . . , which is the binary value of the scaled and rounded x i, j given to n bits.
To perform the transformation done by f for a prescribed number of bits (i.e., n), we first scale x i, j values to remain within the two bounds l low and l up defined as follows.
Now consider the example where the training data point x ti j is to be scaled. To scale x ti j , governed by the column subscript of x ti j , we select the j th column of the corresponding training feature matrix X t . The j th column symbolized as X t j | ∈ R a comes as follows.
The minimum and maximum values contained within X t j | are denoted as min(X t j |) and max(X t j |) respectively. With those, we define the scaling of any training feature value x ti j as follows: where x ti js is the scaled value of x ti j . Now consider the case of scaling testing feature values in X te . To scale any testing feature value x tei j , we consider the same min(X t j |) and max(X t j |) coming from X t j |. This selection is governed by the column subscript of x tei j .
With those, we define the scaling of any testing feature value x tei j as follows: where x tei js is the scaled value of x tei j . When scaling testing feature values using (18), there is a chance of some scaled values lying outside the bounds specified by l low and l up . That is a limitation in this scaling method and to alleviate some of the adversity caused by outliers, for all i, j where x tei js < l low we assign and for all i, j where x tei js > l up we assign as follows.
x tei js ← l up (20) Following the assignments of (19) and (20), all training and testing feature values in X t and X te would have been scaled to map within the lower and upper bounds prescribed by l low and l up in (14) and (15).
Rounding the scaled training feature (x ti js ) and testing feature (x tei js ) values to their nearest integers, and converting the rounded numbers to binary, and representing the binary values in n bits is how the Binary Spectrum vectors symbolized by . . . . . . . . . in (13) are formed. Substituting the Binary Spectrum vectors constructed in that manner in (10), (12) and (9), (11) would yield the Binary Spectrum matrices X tb and X teb .
With this data, we define the following operations o(X tb ) = X tb v and o(X teb ) = X teb v with respect to v ∈ R b(n+1)×b(n+1) . v here is an orthonormal basis of X tb .
A new classifier C n of the form C n ,C n : R d×b(n+1) → B d can now be trained with the training dataset o(X tb ) and Y t . This classifier would predict the classes for the testing data X teb . The prediction output will come in a vectorŶ teb ∈ B c given as follows.Ŷ The vector errb can then be defined in order to compute the classification accuracy.
Similar to the vector err in (6), locations of errb corresponding to instances of a correct prediction having been made would carry zeros. Therefore, we define the total number of zeros in errb as sumb 0 . With that we define the classification accuracy accb in the following manner.
As such, accb can be represented as a function in the following manner, similar to the function representation of acc shown in (8).
Now suppose a classifier C n for some n ∈ Z + can be found such that the condition accb > acc is satisfied. Then, our objective of increasing the classifier accuracy will be accomplished, without removing any features from or adding new features to the feature sets X t and X te , i.e., by satisfying Assumption 1. As opposed to reducing or increasing the feature sets X t and X te , what enables superior classification accuracy in the proposed method will be the Binary Spectrum transformation of existing features.
Recall now the function representation of acc and accb given in (8) and (24) respectively. With those, the ultimate solution one can seek following this method can be expressed in the following manner subject to the constraints accb > acc and n < n max , where n max ∈ Z + is some meaningful maximum number of bits to be allowed. The selection of an ideal value for n max is an open question for the time being, and users of this method have freedom to experiment. An initial constraint some might hypothesize for n max may be with the objective of keeping the Binary Spectrum training feature matrix X tb in (9) a tall and skinny matrix, assuming a > b provided the training dataset in original form (i.e., X t in (2)) has a instances and b features.
Finding an optimal solution n * by solving (25) will result in an optimal Binary Spectrum transformation f (X t , n * ) and a classifier C n * trained from o(X tb ) and Y t , that performs superior to the classifier C trained from sets o(X t ) and Y t . Thus, the objective of increasing classification accuracy will be accomplished via the Binary Spectrum transformation. As a preliminary effort, we propose Algorithm 1 to find n * and corresponding C n * iteratively.
Algorithm 1: Find n * and C n * iteratively.
Result: n * , C n * n * ← 0; C n * ← C, C is trained with o(X t ), Y t ; Y te ← C(o(X te )); acc ← acc, calculate acc of C from (7); n ← 1; n max ← n max (∈ Z + ), n max > 1; while n ≤ n max do X tb ← X t f (X t , n) ; train C n with o(X tb ), Y t ; X teb ← X te f (X te , n) ; Y teb ← C n (o(X teb )); calculate accb of C n from (23); if accb > acc then acc ← accb; n * ← n; C n * ← C n ; end n ← n + 1; end

IV. DEMONSTRATIVE EXAMPLE, EXPERIMENTS & RESULTS
In this section we demonstrate how the proposed Binary Spectrum feature (or transformation) improves classification performance. We consider a Support Vector Machines (SVM) binary (or two-class) classifier working with two features (or descriptor variables), i.e., X ∈ R h×2 , Y ∈ B h .

A. The Dataset
The dataset used for this work consists 8,400 Pulsed Eddy Current (PEC) signal measurements captured on different wall thickness values of grey cast iron. The dataset has been collected through works [8], [9], [10], [12], [13], [14]. The class labels (in Y ) are decided based on the wall thickness (measured in mm). To respect Assumption 2 (i.e., to have approximately 50:50 population split for the two classes in the training dataset), the cut off thickness value was chosen to be 23.3 mm after examining the data. Thickness values less than or equal to 23.3 mm are considered as Class 1 having class label '0'. Class 1 has 4,169 instances accounting to 49.63% of the total population. Thickness values greater than 23.3 mm are considered as Class 2 having class label '1'. Class 2 has 4,231 instances accounting to 50.37% of the total population. The thickness histogram (in percentage frequency) of this total dataset is shown in Fig. 1.
The input set (i.e., X) has two real-valued feature vectors. That means the domain of X becomes X ∈ R h×2 to be corresponding to the vector of labels Y ∈ B h . These features have been extracted from time domain PEC signals [10], [8], [9]. Shown in Fig. 2 is a scatter of the two features corresponding to all 8,400 measurements (or instances) in the total dataset. The two classes (i.e., Class 1 and Class 2) are colour coded in Fig. 2.

B. Splitting Training and Testing Sets
Authors assumed that only 30% of the total dataset will be available for training. To start with, authors performed random nonstratified partitioning of the 8,400 measurements, to a 30:70 split, 100 times. This means, the authors would have, 100 subsets of 30% of the total dataset, and 100 corresponding subsets of 70% of the total dataset. This 100fold splitting, would provide authors with 100 trials to assess classifier performance. The authors' intention was to test the 100 trials separately. That is, to learn 100 classifiers, and perform 100 corresponding validations. If the 100 classifiers and the accuracy of the 100 corresponding validations statistically exhibit some convergence, that would imply the success (or the unsuccessfulness) of the work of this paper.
As an example, Fig. 3 shows the thickness histogram (with percentage frequency) of the training set (i.e. 30% subset) of the 100 th trial (or partitioning). This dataset has 2,520 instances, and the histogram is comparable to the thickness histogram (with percentage frequency) of the total population (i.e., 100% of the data) shown in Fig. 1. Having comparable distributions as such is expected for model training / testing exercises, and it appeared that with the availability of a total of 8,400 measurements (or instances), random nonstratified partitions of 30% would usually have distributions comparable with the total population. This observation was common among all the obtained partitions. Fig. 3. Training data (i.e., 2,520 instances), thickness histogram of the 100 th trial, in percentage frequency. Fig. 4 is a scatter of the two features corresponding to the 2,520 measurements (or instances) in the training dataset that came from the 100 th trial (or partitioning). The two classes (i.e., Class 1 and Class 2) are colour coded in Fig. 4. The distribution of instances in Fig. 4 is comparable to that of Fig. 2, indicating that the training dataset has a distribution that is comparable with the total population. This observation was common among all the obtained partitions. Further, this training dataset in Fig. 4 had 1,273 instances (i.e., 50.52%) in Class 1, and 1,247 instances (i.e., 49.48%) in Class 2, indicating that the training sample is in accordance with Assumption 2 (i.e., the training dataset having approximately a 50:50 split among the two classes). This compliance with Assumption 2 as well, was common among all the obtained partitions. In all the 100 trials, the 100% of the data (i.e., the total population) was used as the testing dataset. Assessing classifier performance in that manner on a single large dataset makes the performance of each classifier statistically comparable.

C. Benchmarking Classification Accuracy with Features' Original Form
SVM is used for classification. The objective is to first benchmark the performance of an SVM classifier over the 100 trials (described in subsection IV-B) with the features in their original decimal form (i.e., prior to performing Binary Spectrum transformation). The subsequent intention is to assess the performance of an SVM classifier trained by Binary Spectrum features over the same 100 trials. In this subsection, we present the former analysis, i.e., benchmarking performance of an SVM classifier trained with the original form of the features over the 100 trials.
Given the non-linear nature of the data, the Gaussian kernel was chosen for SVM classification. The commonly known hyper-parameters named the Box Constraint and the Kernel Scale were set to be optimized at training (done with the 30% splits explained in subsection IV-B). The o(X) features (or the predictor variables) were made to standardize before being fed to the classifier. On every trial, the initial value given to both those parameters (i.e., Box Constraint and Kernel Scale) was 1. The optimized classifier resulting from every trial was then evaluated with the testing data (i.e., the 100%, or the total dataset as mentioned in subsection IV-B). The acc value (recall from (7)) for each of the 100 trials was recorder to serve as the metric for performance evaluation. Depicted by the broken black line in Fig. 5 are the acc values resulting from the 100 trials of training a classifier with the raw values of features.

D. Evaluating Classification Accuracy with Binary Spectrum Features
The accb value (recall from (23)) of the best performing classifier (i.e., C n * identified from Algorithm 1) for each of the 100 trials (the same ones benchmarked in subsection IV-C) was recorder. These accb values would serve as the metric for performance evaluation of the Binary Spectrum feature. Depicted by the solid black line in Fig. 5 are the accb values recorded likewise from the corresponding 100 trials. As for the preliminary work reported in this paper, n max (recall from Algorithm 1) was set to 10. Greater n max values as well can be evaluated. Selection of SVM kernel, hyperparameter initialization, and the training procedure (now considering the Binary Spectrum feature), were identical to those described in subsection IV-C).
As evident from Fig. 5, it was possible to achieve superiorly performing classifiers on every single trial by using the Binary Spectrum feature. On average, an improvement of 1.46% in classification accuracy (calculated as the average of (accb − acc)/acc × 100%) was observed across the 100 trials. The maximum improvement of classification accuracy was 3.49% in the 59 th trial. Minimum observed improvement of classification accuracy was 0.1% in the 9 th trial.
Depicted in Fig. 6 is the variation of the accb of all 100 trails across the 10 bits (i.e. n max = 10 for the work reported in this paper). There are 100 line graphs in Fig. 6 spanning across 1 through 10. Each line graph corresponds to a trial and illustrates the variation of the trial's accb. It can be observed that there is a clear trend of reduction of accuracy past around the 5 ∼ 7 bit mark. Whether this downward trend persists for greater number of bits was not investigated at this stage. What we intend to report as a finding, is the fact that it is possible to increase the classification accuracy of a given supervised learned classifier with a given fixed set of continuous-valued features, by using the proposed Binary Spectrum feature (or transformation). V. CONCLUSIONS The case of improving classification accuracy for a given supervised learning classification task that has to be performed with a given reduced set of continuous-valued features was considered. The case imposes that further reduction of features or addition of new features is not possible. All the provided features have to be used. It was shown via a demonstrative binary classification (i.e., two-class classification) example, that it is possible to increase classification accuracy within the considered premise, via a feature transformation that produces a novel feature set the authors have named the "Binary Spectrum feature". An increase of classification accuracy by about 1.46%(≈ 1.5%) was observed for the considered example following Binary Spectrum transformation. The derivation of the Binary Spectrum feature was presented in detail along with a preliminary algorithm to identify best performing classifiers. The findings indicate potential for broader usage of the Binary Spectrum feature and may provoke interest for further investigation.
Limitations of this study include the following: (1) Only a binary classification task was examined (i.e., multi-class classification was not examined); (2) The descriptor variables were imposed to be continuous-valued (i.e., the more general case of both continuous-valued and categorical descriptor variables being present was not considered); (3) Class populations were imposed to be even (i.e., the case of uneven class populations was not examined); (4) The feature set of the case study included only two feature vectors (i.e., a higher dimensional example was not examined). As such, future work can investigate on relaxing some of the assumptions imposed on this work. Performance evaluation of the Binary Spectrum transformation on more sophisticated classification tasks that involve multiple classes and higher dimensional data also remains unchecked.