One-dimensional DCNN Feature Selective Transformation with LSTM-RDN for Image Classiﬁcation

—Feature selection and transformation are the important techniques in machine learning ﬁeld. A good feature selection or transformation will greatly improve the performance of classiﬁcation method. In this work, we proposed a simple but efﬁcient image classiﬁcation method which is based on two-stage processing strategy. In the ﬁrst stage, the one-dimensional features are obtained from image by transfer learning with the pre-trained Deep Convolutional Neural Networks (DCNN). These one-dimensional DCNN features still have the short-comings of information redundancy and weak distinguishing ability. Therefore, it is necessary to use feature transformation to further obtain more discriminative features. We propose a feature learning and selective transformation network based on Long Short-Term Memory (LSTM) combing ReLU and Dropout layers (called LSTM-RDN) to further process one-dimensional DCNN features. The veriﬁcation experiments were conducted on three public object image datasets (Cifar10, Ci-far100 and Fashion-MNIST), three ﬁne-grained image datasets (CUB200-2011, Stanford-Cars, FGVC-Aircraft) and a COVID-19 dataset, and several backbone network models were used, including AlexNet, VGG16, ResNet18, ResNet101, InceptionV2 and EfﬁcientNet-b0. Experimental results have shown that the recognition performance of the proposed method can signiﬁcantly exceed the performance of existing state-of-the-art methods. The level of machine vision classiﬁcation has reached the bottleneck, it is difﬁcult to solve this problem by using a large-scale network model which has huge parameters that need to be optimized. We present an effective approach for breaking through the bottleneck of visual classiﬁcation task by feature extraction with backbone DCNN and feature selective transformation with LSTM-RDN, separately. The code and pre-trained models are available from: https://github


I. INTRODUCTION
Machine vision technique is becoming more and more important in our life and work which in turn promotes the development of machine vision technique. In fact, our requirements for the technical level of machine vision are getting higher and higher. For example, we require computers to accurately identify animal types, accurately locate moving targets, and accurately predict the type of pneumonia through CT images. There are three major tasks of machine vision including image classification, target detection, and image segmentation. Image classification is the basis among the three tasks. Its importance is self-evident, but the current image classification technology is still at a low level and cannot meet the requirements of our life and work.
In machine vision, the feature extraction of the image is a necessary step. Filters and descriptors [1] are two basic techniques for extracting image features. The former mainly uses linear transformation of pixels in the local area of the image to obtain more prominent image information and meanwhile remove the noise of image. These filters include DoG [2], Wavelet [3], etc. Descriptors such as LBP [4] and SIFT [5], mainly perform more complex calculations on the pixels within the local area of the image, and these calculations include calculating the local image geometric information or statistical information. Due to the complexity of image, the local features yielded from the aforementioned handcrafted descriptors or filters are mainly determined by the pixels in the local area, and their performance is very limited. In recent years, researchers' interest has shifted from handcrafted descriptors and filters to learning-based descriptors and filters on machine learning filed. The latter introduces a learning mechanism to integrate the local features obtained from local regions into global features, and at the same time adjusts the parameters of filters or descriptors through training feedback information to obtain better representation performance. Dictionary learning [6], [7], Bag of Visual Words(BoVW) [8] and neural network learning [9] are three types of learning-based descriptors.
Filters and descriptors focus on the local features of the image, and then use statistical techniques such as histograms to get the overall features, while the method based on Deep Convolutional Neural Network (DCNN) directly extracts the overall features from the whole image. The advantage of this direct feature extraction method is that it can implement an end-to-end learning process, and it is more convenient to obtain the features of high-level image semantics. AlexNet [10] is a classic shallow and single-branch CNN; it outperformed the traditional machine learning methods and won the championship on the ImageNet dataset in 2012. With the development of the times, this shallow and single-branch neural network is increasingly unable to meet the challenge of more complex applications. Since then, various effective network models have been designed to overcome the difficulties encountered in image classification. At present, the neural network structure is developing in the direction of more network branches (horizontal) and deeper network layers (vertical). The vertical development of the network focuses on using deeper layers to obtain better results. The original AlexNet only had more than 20 layers, later networks such as VGG model [11] have dozens of layers, and current networks such as NasNet [12] reaches to thousands of layers. The horizontal development of the network is not as fast as the vertical direction; increasing limited branches at certain layer of the network to extract a wider range of image features is the main strategy; the models have more branches including the residual structure of the ResNet [13], WideResNet [14] and the branch structure of the Inception networks [15], [16], [17].
The DCNN image classification performance has reached a bottleneck, and it is very difficult to significantly improve the recognition performance. Recently, some new visual models appeared such as Transformer and multi-layer perception (MLP) model. Transformer [18] used in the natural language processing field, has been introduced into the field of machine vision and received great attention. ResMLP [19] and MLP-Mixer [20] are two simple MLP-based network and achieve unexpectedly high performance on ImageNet classification benchmarks; they have an architecture built entirely upon MLP which are embedded into the residual network.
Super-deep (DCNN) and multi-branched network structure make the model more demanding on the machine hardware. It usually takes several days or even dozens of weeks to train a large model because the huge parameters of network; but the performance improvement is indeed insignificant, at least not achieving our expectation. Is this kind of super-deep structure necessary? Researchers are exploring using Transformers, Attention-based models [21], MLP-based networks, or different combinations of these methods to avoid the disadvantages of super-deep network model. Different from these methods, we use two-stage processing strategy to overcome the problem of huge parameters, so as to improve the performance of network model. The proposed method only needs the one-dimensional features of the DCNN, such as ResNet18, and then uses the feature selective transformation techniques and fine-tunes the model parameters, to obtain promising results which are much higher than the recognition results of other super-deep DCNN and Transformer models.
Our experiments show that there is no significant performance improvement between the shallow neural network and the super-deep neural network after feature selective transformation at most of cases, which is also one of the main contributions of this article. In addition, because there is correlation between DCNN features, we regard the one-dimensional DCNN output features as vector sequences, and use Short-Term Memory (LSTM) to perform feature transformation to make the features have better distinguishing capabilities; at the same time, we also added ReLU and Dropout functions in the network to perform feature selective transformation for improving the generalization ability.

II. RELATED WORK
A typical classification task of machine vision is to predict the labels through sample images. The first step of classification task is to perform the feature extraction which is to convert the sample image into a set of features with obvious physical meaning such as geometric and texture features, or statistical significance(see Fig.1). If the samples have few features, we will consider adding more features. In reality, there are often too many features, and some of them need to be reduced. Therefore, we often use feature selection or feature transformation to continue the original features. Feature transformation and selection are to find the most effective features (in-variance of similar samples, discrimination of different samples, robustness to noise) from the original features. They not only alleviate overfitting, the runtime and improve the generalization ability of the model, but also enable the model to obtain better interpret-ability and enhance the understanding content of the images. Typical feature extraction algorithms include Gabor [22], HOG [23] and LBP descriptors used to obtained the low-level features of image; MLP [24] and DCNN which is used to extract the high-level features. Feature transformation methods are relatively extensive, and involve most of the machine learning and pattern recognition approaches. Some feature extraction methods can also be used as feature transformation like MLP and DCNN. In addition, Recurrent Neural Network (RNN) [25] and Dictionary Learning (DL) [26] is also commonly used for feature transformation. Feature selection refers to selecting N significant features from the existing M features to optimize the specific indicators of the system. Principal component analysis (PCA) [27] and linear discriminant analysis (LDA) [28] and Deep Forest (DF) [29] are the effective methods for feature selection. It should be emphasized that the difference between the three technologies (feature extraction, transformation and selection) is not clear. In many cases, the same model (such as DCNN) can be used for feature extraction and feature transformation; PCA and LDA also use feature selection operation when performing feature transformation. For the sake of distinction in this work, the method of finding the optimal features by using feature transformation and selection at the same time is called the feature selective transformation.
In recent years, LSTM network have emerged for image classification and other processing. It is a time cyclic neural network, which is specially designed to solve the longterm dependency problem of general RNN (Recurrent Neural Network). RNN has a chain form of repeated neural network modules. In a standard RNN, this repeated structural module is a very simple module, such as a Tanh layer or Linear Layer; while the repeating module of LSTM is more complex and much more effective. The basic cell of LSTM is shown in Fig.2. Since LSTM is a kind of RNN, its output is affected by the current input x t , the previous input h t−1 (hidden units), and the influence of the cell state C t−1 . Overall, LSTM consists of three gates (forget gate, input gates and output gate) which combine the inputs and control the output. The forget gate is expressed as follows There are two important parts in the input gates which will update the cell state, denoted as follows.
The new cell state C t is expressed as The final output h t (or y t ) is the result of multiplying the output o t of the output gate and the current cell state C t In image analysis, although DCNN has achieved great success, its performance is still difficult to meet the requirements in practical applications. LSTM is an effective tool to make up for the shortcomings of DCNN. Recently, there have been many examples of DCNN combined with LSTM. For example, [30] proposes handwriting recognition based on Convolutional-LSTM Network; [31] proposes Tree-Structured Regional CNN-LSTM model for dimensional sentiment analysis. In the hybrid combination, LSTM is mostly placed at the end of the network or used as the intermediate processing unit of the network.

III. LSTM-BASED FEATURE SELECTIVE TRANSFORMATION NETWORK
The process of the proposed image recognition method based on two-stage strategy with feature selective transformation is shown in Fig.3. The method uses a pre-trained network model (such as ResNet) as the backbone model, and then train the pre-trained network model on a specific image dataset by using transfer learning for obtaining the specified network model. By using the specified network model, we can obtain the two new one-dimensional DCNN datasets: train set and test set. Finally, we use the feature selection network based on three types of components LSTM, ReLU and Dropout layers (called LSTM-RDN) to train on the new train set and classify the test images on the new test set.
LSTM-RDN roughly consists of two parts, as shown in Fig.4: The first part is the feature transformation component which is a LSTM layer. We regard the one-dimensional DCNN features as a sequence, and there is a strong correlation between the features. This correlation can be memorized and utilized by LSTM to optimize the transformation weights of the current input features. The second part is a feature selection component composed of a Rectified Linear Unit(ReLU) layer and several Dropout layers. This part uses ReLU to perform nonlinear mapping on the output of the LSTM layer, and then uses three Dropout layers to discard some features.
The two parts of the proposed network are mainly to improve the discriminability of DCNN features as much as possible. The following experiments show that if only LSTM layer is used without the feature selection part (ReLU and Dropout layers), the model's performance will be greatly reduced (See the experiment and discussion part for details). At the front of LSTM-RDN is the sequence layer, which is used to organize the serialized data, followed by a Dropout layer (named D0) which is used to randomly discard some features before the DCNN features are fed to the LSTM layer.
The only parameter of the Dropout layer is the drop probability, which plays a very important role. Probability for dropping out input elements, specified as a numeric scalar in the range [0,1]. In LSTM-RDN, we set up a total of three Dropout layers before and after the LSTM layer. Theoretically, D1 and D2 can be combined into one Dropout. Suppose the dropping probabilities of D1 and D2 are p 1 and p 2 ,  respectively, and their combined dropping probability p is Because the drop probability is a very critical parameter in LSTM-RDN, its subtle changes may cause large fluctuations in the recognition performance of the model. Therefore, we split a dropout layer into several layers for the purpose of fine-tuning the parameters. Similarly, we can divide dropout D0 into multiple dropouts using EQ (5). LSTM feature transformation has the similar function as PCA, LDA, Dictionary Learning and fully connected networks, all of which can transform DCNN feature vectors into new feature vectors with more distinguishing performance. PCA / LDA needs to compute the divergence matrix of the all the DCNN feature vectors in the training set at one time; however, the performance is limited because of the lack of learning mechanisms. Fully connected network / dictionary learning obtains the transform model / dictionary by sending the samples to the model one by one or batch by batch with learning approach, but does not consider the correlation between the feature sequences. The characteristic of LSTM feature transformation is that LSTM processes onedimensional DCNN features as time series, so it inputs these features one by one to train the model (see Fig.5), and at the same time it computes the output with the current and previous states of the model.
The schematic diagram of the feature selection part composed of a ReLU and three Dropout layers is given in Fig.6.
ReLU is a commonly used activation function in artificial neural networks. The reason why ReLU is very useful is that it makes the output of some neurons zero, which leads to the sparsity of feature sparsity (or network). That is, it can let each neuron perform the screening function: if the input of neuron is larger than the given value, then the input will be enlarged vigorously; if not it will be cut off, denoted by where z i is the input of neuron in the ReLU layer. ReLU and Dropout implement similar functions in different ways. Dropout is to randomly drop some nodes (or weight parameters) to participate in network training, so as to obtain more robust new features. Since both ReLU and Dropout have the function of zeroing or discarding a certain feature, we call their function here as feature selective transformation. "Attention" mechanism which is a popular technique uses weight to amplify or weaken the value of certain input variables or the block of feature map [32]. We proposed LSTM-RDN also has similar select functions as Attention mechanism.

IV. EXPERIMENTS
The experiment was conducted on two types of image datasets and a COVID-19 dataset. The first type is object datasets, including Cifar10, Cifar100 and fashion-MNIST. The second type is fine-Grained image datasets, including  Fig.7 and Fig.8 The backbone network models we used include: AlexNet, VGG16, ResNet18, ResNet101, InceptionV2 and EfficientNet-b0. These pre-trained models are all trained on ImageNet. After downloading these models, we train them with transfer learning on the above datasets, respectively, to obtain the 1D DCNN features of the images. To object and fine-grained image recognition, in the feature selective transformation phase, Root Mean Square Prop (RMSProp) optimization algorithm is used to train LSTM-RDN; the initial learning rate is 0.001, and the maximum number of initial iterations is set to 2000 (if the mini-batch loss is not less than 0.01 or the mini-batch recognition accuracy is less than 100%, we will increase the number of training appropriately). Under normal circumstances, the dropout probability of several dropout layers is set to 0.5; when the number of training samples is small or the feature dimension is large (such as the 4096-dimensional features of VGG16), the dropout probability will be increased to 0.6-0.85.
For different databases and different backbone networks, we use fine-tuning approach to train LSTM-RDN model in the experiments. Due to the randomness in the training process of LSTM-RDN, the results of several tests may be different, and the classification accuracies of a model obtained by the experiment are the relatively stable results in several tests. Our experiments were performed on a computer with CPU i9 and GPU 2080ti environment.

A. Object Classification
The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. Cifar100 has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. Fashion-MNIST dataset contains 10 categories of images. The training dataset contains 6000 samples for each category, the test dataset contains 1000 samples for each category, and there are 10 categories in total. After the pre-trained model is performed with transfer learning on target dataset, we compute the recognition accuracies of backbones by using the transfer-learned models which classify the images with Softmax layer, and the recognition accuracies of our proposed LSTM-RDN based on the 1D features of the transfer-learned models. The recognition accuracies are listed in Table I. On the three datasets of Cifar10, Cifar100 and Fashion-MNIST, the recognition accuracies of our method on three datasets are higher than 98%, and the highest is 99.98% on Fashion-MNIST. The performance of these pretrained models has been significantly improved after using LSTM-RDN, and the recognition accuracies on Cifar10 and Fashion-MNIST has been improved from 5% to 14% (see the rightmost column of Table 2). On Fashion-MNIST, the recognition accuracies of these models exceed 99%. Especially on Cifar100, the recognition accuracies of the transferlearned models are only 60%, and LSTM-RDN can increase backbones' recognition accuracies to more than 95%, and the improved percentage point is more than 30%. To illustrate our method LSTM-RDN is effective and advanced, we compare it with the advanced methods that have been published. These comparisons are listed in TableII-TableIV. Currently, the best records on the three datasets are: 99.70%(Cifar10), 96.08%(Ci-far100), 96.91%(Fashion-MNIST), and our highest results are 99.94% (Cifar10), 98.21%(Cifar100) and 99.98% (Fashion-MNIST). It is obvious that the recognition performance of LSTM-RDN based on pre-trained backbones is significantly higher than the best record.

B. Fine-Grained Image Classification
Fine-grained image recognition is a more refined distinction between target image categories, such as not only identifying the image label is a dog, but also telling whether it is a Labrador or a Husky. The task of fine-grained image recognition has been proposed in the past few years, This task has not been well achieved, on the contrary, it is far from our expectations.
CUB200-2011 dataset has 11788 bird images, including 200 bird sub-categories. The training dataset has 5994 images, and the test set has 5794 images. Stanford Cars dataset is an image dataset containing 196 types of cars. It is mainly  Table  V. It can be seen that the classification ability of the pretrained model on these three fine-grained images is relatively poor, and the recognition accuracies of these models are roughly between 50% and 70%. However, after feature selective transformation with LSTM-RDN, their recognition accuracies soared to more than 90%, and the performance increased by 30 to 50%. The results shows that LSTM-RDN is very effective. On CUB200-2011, the current best recognition accuracy is 90%, and our proposed method based on ResNet models(ResNet18 and ResNet101) all exceed 98%; while the lowest recognition accuracy by using LSTM-RDN (AlexNet) reaches 92.30% (AlexNet), which is 2 percentage points higher than the best method. On the Stanford Cars and FGVC- Aircraft datasets, the highest recognition accuracy accura-cies are 96.2% DAT(AmoebaNet-B) and 94.70% MMAL-Net(ResNet-50), respectively; while, the best recognition accuracies of LSTM-RDN (ResNet18) are 98.81% and 95.64%,respectively. The specific results are shown in Table VII and  Table VIII.

C. COVID-19 prediction
We train a binary classification model with transfer learning for predicting whether a CT image is COVID or non-COVID. The evaluation is executed on COVID CT dataset [51] which is an internet-based dataset that contains CT images from patients. This dataset contains 349 CT images positive for COVID-19 belonging to 216 patients and 397 CT images that are negative for COVID-19. It is divided into three sets: train, validation and test sets(see TableIX). The train and validation sets are used for model learning. The test set is used to measure the recognition accuracy of the models. In order to obtain good performance, for the one-dimensional features generated by different backbone models, we use different training parameters including loss functions, learning rates and dropout probabilities to fine-tune LSTM-RDN model.
In the experiments, accuracy, precision, F1-score and recall evaluation measures were used. The calculation of these mea- sures is related to the confusion matrix. The confusion matrix for binary classification is shown in Fig.9.  [52] 93.0 DenseNet-169 (mask) [51] 87.1 TL+CSS [51] 89.1 Multi-Task deep model [53] 86.0 Fig. 10. Samples selected from COVID CT dataset. The first two rows show CT images of COVID; The last two rows are CT images Non-COVID TableX shows the recognition accuracy of VGG16 is 73.39%, while the performance is improved to 95.46% by feature selective transformation with LSTM-RDN. Resnet18 achieved an accuracy of 95.51%, which is very close to VGG16's recognition performance. Resnet101, by contrast, performs quite well on this dataset, with a recognition rate of 99.01%. While, the accuracy of DenseNet169 with mask which has more parameters and more layers is 87.1%, and TL+CSS obtains 89.1% accuracy. The advanced method HFSM [52], which also uses feature selection to improve performance, achieves 93.0% accuracy. It can be seen that our method achieves the good performance on the three backbone models (VGG16,ResNet18 and ResNet101). In other measures, DenseNet169 (mask) and TL+CSS achieved 88.1% and 89.6% on F1-score, respectively, and our method significantly surpassed these two methods, reaching a maximum of 98.97% , and increasing about 10 percent points. Although HFSM performs well in accuracy, it is relatively poor in precision and recall measures, only getting 72% and 71%, respectively. LSTM-RDN(ResNet101) shows excellent performance; its precision is 100%, and the recall also reaches 97.96%. LSTM-RDN(ResNet18) and LSTM-RDN (VGG16) are also significantly better than the other three methods.
Due to the randomness of LST-RDN, the identification results of the network of each trained model may have slight changes. We have conducted 10 times of training and testing, and the testing accuracy obtained has slight fluctuations (see Fig.12). The fluctuation range is about 1.5%. In addition to the use of Dropout, the fluctuations of recognition accuracy maybe caused by the noise in CT images and the small number of samples.

V. DISCUSS AND ABLATION
From Table II and Table VI is far more less than ordinary image datasets. This is the main cause of performance degradation of super-deep model. LSTM-RDN contains two important parts: a LSTM layer, as well as several ReLU and Dropout layers; both parts play an important role. LSTM can increase the recognition accuracy of AlexNet/ResNet18 to more than 80%/90% on the FGVC-Aircraft and Stanford Cars datasets, thanks to the unique memory structure of LSTM. Next, we need to know how effective the Dropout layer is in LSTM-RDN. Fig.13 and Fig.14 show the recognition performance of LSTM combined with different numbers of Dropout layers on two datasets (AlexNet and ResNet18). When the number of feature dimensions is large, more Dropout layers are needed. It also requires more training time. In the experiment, we conclude that the performance of the two Dropout layers is almost optimal based on considering the recognition accuracy and training time.
It is natural to think of using LSTM-RDN to combine the features of multiple models for further improving performance. We make two combinations including ResNet18+AlexNet and ResNet18+ResNet101 to evaluate the performance on Cifar100 (see Table XII). The experiment on Cifar100 shows that compared with a single DCNN features, the feature combination of two DCNNs only slightly improves the recognition accuracy. The advantage of combination is we can obtain more robust performance compared to single model.

VI. CONCLUSIONS
At present, a large number of image classification models have appeared, and have achieved unprecedented performance improvements. However, they still have not broken the bottleneck of classification tasks. The parameter scale of hundreds of     millions is the main factor restricting performance improvement. With mainstream optimal techniques such as gradient descent algorithm, it is difficult to achieve optimal model parameters even using millions of samples and thousands of training epochs. Multi-stage processing strategy can reduce the workload of optimizing parameters exponentially, which is an effective strategy to overcome the problem of huge parameters. We have improved the classification accuracy to a new level on several image databases through a two-stage strategy (feature extraction and feature selective transformation). Experiments show that the accuracy of the backbone model network can be increased by more than 30%. Based on multi-stage strategy, we proposed LSTM-RDN, which can improve the performance of the pre-trained network models such as AlexNet and ResNet18 to a very satisfactory level. We also come to the conclusion the performance of shallow network can be even better than that of super-deep network by feature selective transformation. The LSTM-RDN's performance may have room for improvement, so, in future work, we will continue to conduct more research on the feature selective transformation, such as exploring the influence of the feature dimension, as well as the number of LSTM layers and the number of hidden units of LSTM-RDN.