Similarity-Based Clustering for Enhancing Image Classification Architectures

Convolutional networks are at the center of best-in-class computer vision applications for a wide assortment of undertakings. Since 2014, a profound amount of work began to make better convolutional architectures, yielding generous additions in different benchmarks. Albeit expanded model size and computational cost will, in general, mean prompt quality increases for most undertakings but, the architectures now need to have some additional information to increase the performance. I show evidence that with the amalgamation of content-based image similarity and deep learning models, we can provide the flow of information which can be used in making clustered learning possible. The paper shows how training of sub-dataset clusters not only reduces the cost of computation but also increases the speed of evaluating and tuning a model on the given dataset.


Introduction
Understanding the world in a solitary look is one of the most practiced accomplishments of the human mind. It takes only milliseconds to perceive the classification of an object or an action, underlining a significant job of feedforward processing in visual analysis. With the introduction of deep convolutional neural networks [1], which achieved breakthrough accuracies in the domain of image classifications, there has been a profuse amount of work in developing better architectures and training methodologies [2], [3]. The current work, in image classification, is more focused on the networks and lesser when it comes to the understanding of the dataset and its structure.
For instance, we saw the work on depthwise separable network architectures [4], [5] and dynamic model scaling [6], with the methodologies like the Noisy student's semi-supervised learning [7], which brought a higher accuracy on Ima-geNet while providing an intuition into self-learning and distillation. Although structures like these have produced astonishing results on many of the large databases like ImageNet [8], they still need to be tweaked and trained in a particular way to get the best of object detection [9], while dealing with fine-grained datasets [10]. We have seen a significant improvement in accuracy by tuning the models based on knowledge acquired from studying the structure of the dataset.
So what if we append that type of information to the model itself?
The answer came from a field that has yet not been amalgamated, with modern deep learning architectures: Image similarity [11]. Since the foundation of the field, we have seen a ton of work in understanding how machines see an image. For example, when image fusion evolved with the idea to make the output image show the understanding of the scene more elaborately than all the input images before, it was a sincere thought into the development of image retrieval systems [12], [13], [14], [15]. And by combining the use of deep learning to understand the content-based image similarities, it was now easier to develop accurate analysis models. We show how by amalgamating these two fields, we achieve even better architectures. I show the need for an additional meta-data field, which I call: Similarity Matrix. It could help save hours of computational time and also increase the accuracy of the deepest of the architectures on class-similar datasets. While working with the datasets which contain a hodgepodge of similar classes and distinctive ones, for example, Stanford's dog breeds dataset [10], it is difficult to tune the model efficiently for better performance. The problem is not only the drop in accuracy but also the overfitting of the model. The standard techniques could only provide a way out of overfitting up to a certain limit [16].

Clustered training
What I propose is a clustering of the dataset into two or more sub-datasets, with sequential training to improve existing architectures. The clusters are formed based on the similarity matrix of all the classes inside the dataset, and hence we find the most similar classes together in one cluster. The main reason for grouping similar classes together was to get a little lower number of acceptable clusters which have enough intercluster distance for the master classifier to accurately select an appropriate cluster for the new image inference.
Many would argue that while clustering, it is better to group dissimilar classes to ensure even better accuracy. This idea does make sense for a lower number of classes. But when we go even beyond 20 classes, the scalability issues are strongly reflected. Let us assume that there are 50 classes in a dataset.
Realistically speaking, how many clusters would need to be formed so that clustering dissimilar classes help the classification accuracy? Also, how far would these clusters be from each other, i.e., the intercluster distance, so that the master classifier (the one that chooses the appropriate cluster for the inference image to go to) will be accurate? Here are the two possibilities: 1. If the number of clusters is low, the classification accuracy inside a particular cluster will be as good as an accuracy on the entire dataset. Additionally, the master classifier will not be able to classify it too well as there is no real set of distinctiveness between clusters. The cluster choosing accuracy would be too low.
2. If the number of clusters is high, the same problem of choosing the correct cluster appears, with the additional cost in training (as there might not be that much parallelism available) Both of these options are in no way better than training on the master dataset directly. When we group similar classes, the intercluster distance is quite high as the clusters will have those distinctive features we talked about earlier. And when the number of classes to classify is lower, the classification accuracy will anyway tend to go higher as well, as better architectures specific to that cluster can be used to improve it. Due to these reasons, the grouping of similar classes together is a better choice.
Once we generate the sub-dataset clusters, we train independent models on each sub-dataset. At inference, we use the feature vectors to choose the best prediction model. This way, we reduce the computational needs as well as retain (and in some cases increase) the model accuracies, even with a limited number of epochs.

Related works
Until recently, most of the work in Image classification has been done, by making the models deeper and the labeled datasets larger. The recent work, in dynamic model scaling, did show the intent of tweaking the model by resolution, depth, and width, which can make it efficient without making it unnecessarily complicated [17]. Researchers have also shown much work in the model architecture's efficiency, like introducing inverted residuals and linear bottlenecks for lighter models [5] and descriptor pyramids [18]. Some work in the latest architecture revolved around the kernel functions [19]. We also saw the work in Deep sparse rectifier networks [20] and even in understanding the difficulties of training deeper networks [21]. We even saw some changes in base architectures to further improve performance [22], but mostly we thought that the deeper architectures and faster computational machinery would provide the accuracies that we want.
Even though possible, sometimes, the problem of overfitting and the cost of computing becomes a huge hurdle. Even with state-of-the-art models, we are not able to achieve impressive accuracy when it comes to class-similar datasets.
What we tried is to check the possibility of improving the methodology by amalgamating the knowledge, driven by the image similarity of classes in the datasets. Much work has been done, in understanding the image, by using content-based analysis [23] or even complex wavelet structural analysis [24].
But the work was mostly transferred to image retrieval systems. We saw work on using this analysis for making the application better [25], which was used not only on content-based image analysis but also object-based image analysis [26].
Further, we saw the development of similarity engines [27] and the best-in-class content-based image retrievals.
With the use of fundamentals of content-based image analysis [14], we produce a similarity matrix for the entire dataset. The similarity between the classes does provide critical insight into training better models. Not improving the architecture, but improving how these architectures are trained. We hence developed a generic methodology of clustering the data into sub-datasets and then train the models independently on them. Once we get the set of models, three cluster split, classes were distributed in the ratio of 33:22:5 across the three splits. The entire list of splits has been provided in Appendix B.

Oxford flowers dataset.
Oxford Flowers dataset [28] contains 8189 images of 102 flower species.
Each class consists of between 40 and 258 images. I have used six versions of this dataset. Like in dogs dataset, the original dataset was split into two-cluster and three-cluster dataset. In the two cluster split, the first cluster contained 86 classes and the second had 16 classes. In the three cluster split, classes were distributed in the ratio of 16:69:17 across the three splits.

Feature Extraction
The first step into the generation of the DSI is in understanding the machine's perspective, .i.e., features that the machine extracts. For the same, I use a pretrained ResNet152 and make a forward pass on all the images. The feature vector is extracted by taking the vector before it is passed to the average pooling layer. Hence, the length of the feature vector, in the case of ResNet152, will be 2048. The reason for extracting the features from the last layer is because the latter layers have more specific features while the previous ones have more general features [29]. For feature extraction there are two other layers available, one is passing the vector to max-pooling and then extracting it. The second is to take the vector at the fully connected layer, which in our case will be 1000 units long. But according to the work by [30], the features work well in the case of Avg. pooling especially while working with content-based image similarity.
Another case to note is the model selection. Although any model can be used for feature extraction, the choice will depend a lot on the depth of feature extraction required. I have noticed that in most cases, the specific features are better extracted with deeper extraction. The argument can also be supported with the results on classification algorithms comparison [31] as well as the comparative results in the model choice in CBIR systems [30].

Similarity matrix generation
There are three steps to similarity matrix generation.
1. Feature centroid generation.  Second, the class should not be checked based on either outliers or extremes, that will only hurt the distance computations for similarity.
Once the feature centroids are extracted, the similarity between the classes can be computed by taking the cosine distance (as per equation 1) between the feature centroids of each class. The reason for using cosine distance is that it achieves the highest mMAP score in CBIR systems when used with the feature vector of the average pooling layer. Additionally, according to the comparative study as well as a poll of 100 human volunteers, distances like Cosine, Euclidean, Manhattan, Vector Cosine Angle Distance (VCAD) achieved the same amount of contradiction between similarities calculated by machine and humans [32].

Cluster formation technique
The cluster formation technique uses the similarity matrix and the number of clusters as input. The algorithm then fits a hierarchical clustering algorithm      Loss

Experiments and discussions
As we break down the dataset into individual clusters that can be trained upon independently, it makes it easier for us to evaluate any change in model architecture. As similar classes would pose the most difficult task, the accuracy achieved in the cluster dataset, will in turn be reflected in the entire dataset trained model. Due to this, it is possible to experiment and evaluate different methodologies in ¼th or lower time. With this, it is also possible to use different models for different sub-dataset clusters. Due to this, we can make sure to use the best possible model for that particular cluster. It makes a significant difference not only in the speed of model training but also in the accuracy.
In Validation accuracy   well on the original dataset too. Hence, clustered training leads to two versions for inference, i.e., using the original dataset trained after bettering the model on the sub-dataset cluster and using the models trained on the clustered datasets directly, with the inference technique shown in Algorithm 3. In Table 1 we see test accuracies of the same model architectures on three different versions of the Oxford flowers dataset, i.e. original dataset, two-cluster dataset and three-cluster dataset. As observed, the latter two always out per- When we train on sub-dataset clusters, the amount of computational resources required drops significantly. As the fundamental load of the amount of data is on runtime memory, the amount of memory required to cache the dataset is also reduced. It makes the learning possible even on lower computational resources.
In Original dataset Two-cluster dataset Three-cluster dataset is the ability to have faster extensions. Let us take, for example, there is a model trained on 120 classes and now there is a need for appending 1 more class to the cluster. In terms of joint training, the model would need to train on 121 classes and hence would take a considerable amount of time. If the same model is trained on 4 sub-dataset clusters with 30 classes each, the new class can be associated with one of those clusters and retrained with only that cluster. So, the training would happen for only 31 classes, rather than 121 classes. Hence, there is an ability to optimize the speed and efficiency of small class extensions with sub-dataset clustering.

Future works and conclusion
The paper shows how the knowledge of similarity of the image classes can be leveraged in making the model training better and faster. It can also be used when the computational capacity is low for a single thread. The dataset can be broken down easily into clusters, which are trained individually, and then a rule-based system can be provided to make the appropriate model choice for predictions. The paper shows that the methodology is adaptive, but still, more work can be done in improving the dynamic feature extractions as well as in reducing the error of choice. The clustering algorithms can be changed and chosen according to the way data is represented. Here we do leave the room for additional parameters on which the methodology can be improved, and comparative analysis with different clustering, number of clusters, and base model architectures can be performed. There is still the possibility of improving the way the clusters are formed or the way the similarity is extracted. The methodology, however, has to be considered. I wanted to drive the deep learning architectures and the modern datasets into the choice of declaring the similarity matrix, to have better knowledge as well as decision capacity of choosing the correct model for classification.

Funding Source Declaration
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.