A Discriminative Neighborhood-Based Collaborative Learning for Remote Sensing Scene Classiﬁcation

—The bag-of-words (BoW) model is one of the most popular representation methods for image classiﬁcation. How-ever, the lack of spatial information, change of illumination, and inter-class similarity among scene categories impair its performance in the remote-sensing domain. To alleviate these issues, this paper proposes to explore the spatial dependencies between different image regions and introduce a neighborhood- based collaborative learning (NBCL) for remote-sensing scene classiﬁcation. Particularly, our proposed method employs mul- tilevel features learning based on small, medium, and large neighborhood regions to enhance the discriminative power of image representation. To achieve this, image patches are selected through a ﬁxed-size sliding window where each image is repre- sented by four independent image region sequences. Apart from multilevel learning, we explicitly impose Gaussian pyramids to magnify the visual information of the scene images and optimize their position and scale parameters locally. Motivated by this, a local descriptor is exploited to extract multilevel and multiscale features that we represent in terms of codewords histogram by performing k-means clustering. Finally, a simple fusion strategy is proposed to balance the contribution of these features, and the fused features are incorporated into a Bidirectional Long Short-Term Memory (BiLSTM) network for constructing the ﬁnal representation for classiﬁcation. Experimental results on NWPU-RESISC45, AID, UC-Merced, and WHU-RS datasets demonstrate that the proposed approach not only surpasses the conventional bag-of-words approaches but also yields signiﬁ- cantly higher classiﬁcation performance than the existing state-of-the-art deep learning methods used nowadays.


I. INTRODUCTION
R EMOTE sensing has received unprecedented attention due to its role in mapping land cover, geographic image retrieval, natural hazards detection, and monitoring changes in land cover. The currently available remote sensing satellites and instruments (e.g., IKONOS, unmanned aerial vehicles (UAVs), synthetic aperture radar, etc.,) for observing earth not only provide high-resolution scene images but also give us an opportunity to study the spatial information with a fine-grained Figure 1. The challenging scene images of AID dataset [68]. (a) the intraclass diversity and (b) interclass similarity are the main obstacles that limit the scene classification performance. This encourages us to learn multilevel spatial features that have small within-class scatter but large between-class separation.
detail. However, within-class diversity among scene categories is one of the main challenges that brings new obstacles in image analysis as shown in Fig.1 (a). The first and second row represents resort and park scenes, respectively. A large diversity can be observed even within the same class. Here, the "scenes" belong to different types of subareas extracted from large satellite images. These subareas could be different types of land covers or objects and possess specific semantic meaning, such as commercial area, dense residential, sparse residential and parking lot in a typical urban area satellite image [68]. With the development of modern technologies, scene classification has been an active research field, and correctly labeling it to a predefined class is still a challenging task.
In the early days, most of the approaches focused on handcrafted features which can be computed based on shape, color, or textual characteristics where commonly used descriptors are local binary patterns (LBPs) [26], scale invariant feature transform [42], color Histogram [56], and histogram oriented gradients (HOG) [18]. A major shortcoming of the low-level descriptors is that not fulfilling the entire scene understanding due to high diversity and non-homogeneous spatial distributions of the scene classes. Moreover, they require complex engineering skills that rely on expert experience. In comparison to handcrafted features, the bag-of-words (BoW) model is one of the famous mid-level (global) representations and is extremely popular in image analysis and classification, while providing an efficient solution for aerial or satellite image scene classification. It was first proposed for text analysis and then extended to images by a spatial pyramid method (SPM) because the vanilla BoW model does not consider spatial and structural information. The SPM method divides the images into several parts and computes BoW histograms from each part based on the structure of local features. The histograms are then concatenated from all image parts to make the final representation [33]. Although these mid-level features are highly efficient, they may not be able to characterize detailed structures and distinct patterns. For instance, some scene classes are represented mainly by individual objects, e.g., runway and airport in remote-sensing datasets. As a result, the BoW model has limited performance in dealing with complex and challenging scene images.
Recently, deep learning based methods have been successfully utilized in scene classification, and proven to be promising in extracting high-level features. For instance, Wang et al. [66] proposed a domain adaptation method based on the deep neural network, where the manifold alignment is adopted on the target domain to avoid distortion. Chen et al. [12] proposed an associative learning-based domain adaptation that does not require target labeled information and can achieve unsupervised classification of the target image. A combined CNN-based recurrent neural network is proposed in [57] to exploit both local and long-range spatial relation information to enhance the classification performance of the model. How-ever, the backpropagation process, hyperparameter setting, or training a CNN from the scratch remains challenging, and a small number of training samples often cause overfitting in deep learning models. To address these limitations, most works tend to use a pre-trained model as a feature extractor, such as VGGNet, which can easily get deep features without high computational cost [44]. However, the deep learning based methods generally analyze an individual patch, and treat different scene categories equally. Thus, they fail to capture contextual dependencies for better representation. To illustrate this observation, some images from AID dataset are displayed as an example in Fig.1 (b). One can see that both stadium and playground scenes display high appearance similarity and the diversity within the scene varies largely. Moreover, natural images can be mainly captured by cameras with manual or auto-focus options and it makes them to be center-biased [28]. However, in the case of remote sensing scene classification, images are usually captured overhead. Therefore, using a CNN as a "black box" to classify remote sensing images may be not good enough for complex scenes. Even though several works [44,6] attempted to focus on the critical local image patches and discard the useless information, they still only utilize the visual information [37].
Despite the success of deep learning and BoW algorithms, the intraclass diversity and interclass similarity are two big challenges that still needed to be addressed. To alleviate these issues, we propose a multilevel learning approach to extract image features region by region based on small, medium, and large neighborhood patches to fully exploit spatial structure information in BoW model. This is motivated by the fact that even patch sizes are different in size, they exhibit good learning ability of spatial dependencies between image region features that may help to interpret the scene [60]. Our proposed method also magnifies the visual information by utilizing Gaussian pyramids and combine these two approaches to solve the problem of remote-sensing scene classification. In order to balance the contribution of these two types of features, we propose a simple fusion strategy based on three motivations. First, even though a considerable amount of literature on fusion of multiscale or multilayer features is available in scene classification, there is currently no clear consensus on the best features for large-scale scene recognition. Our second motivation is to introduce a simple fusion strategy that can surpass the previous performance without utilizing state-ofthe-arts fusion methods such as DCA [10], PCA [35], CCA [46], etc., as previously utilized in remote sensing domain. The third motivation is to evade the disadvantages of traditional dimensionality reduction techniques such as principle component analysis (PCA): its data-dependent characteristic, the computational burden of diagonalize the covariance matrix, and the lack of guarantee that distances in the original and projected spaces are well retained. In summary, we want to develop a simple bag-of-words method to make full use of spatial and visual information, which can effectively improve the classification performance. Therefore, in this paper, we propose a simple, yet very effective approach called neighborhood-based collaborative learning (NBCL) that builds on fusion of the corresponding multilevel and multiscale features. In particular, the NBCL encodes spatial and visual information based on small, medium, and large neighborhood regions. The extracted spatial locations which are used in our work are visualized in Fig 2. The final representation for an image is achieved by fusing small, medium, and large scale spatial and visual histograms. For classification purpose, the BiLSTM approach is adopted due to its proven efficient classification performance. In contrast with previous works which choose a certain sampling (sparse, dense, random, etc.) approach [29,28], the proposed work defines an artificial fixed-size sliding window not only for the original image but also extend in scale space pyramid to accommodate all the multiscale patches of the image and extract local features from each patch. In summary, our main contributions in this paper are summarized as follows: 1) We present a neighborhood-based collaborative learning (NBCL) to combine all the surrounding features into a new single vector and address the problem of intraclass diversity and interclass similarity. 2) To improve the visual information, a smoothing and downsampling is preformed by convolving the image with Gaussian kernels. Simultaneously, we propose to integrate the fixed neighborhood regions into multiple downscaled versions of the input image in a scale space pyramid. In this way, we can explore more content and important information.
3) The proposed method not only surpasses the previous BoW methods but also several state-of-the-art deep learning-based methods on four publicly available datasets and achieves state-of-the-art results. The rest of this work is organized as follows. Section II discusses the related literature work of this study. Section III introduces the proposed NBCL for remote sensing scene classification. Section IV shows the experimental results of the proposed NBCL on several public benchmark datasets. Section V summarizes the entire work and gives suggestions for future research.

II. RELATED WORK
Early attempts heavily depend on the hand-crafted features and focus on different types of color features for remote sensing scene image analysis. Since only spectral information can be utilized, the color features are more convenient to extract in comparison with texture and shape features. The color histograms and color moments provide discriminative features and can be computed based on the local descriptors such as local binary patterns (LBPs) [26], scale invariant feature transform (SIFT) [42], color histogram [56], and histogram oriented gradients (HOG) [18]. Yu et al. [75] proposed a new descriptor called color-texture-structure (CTS) to encode color, texture and structure features. In their work, dense approach is used to build the hierarchical representation of the images. Finally, the co-occurrence patterns of regions are extracted and the local descriptors are encoded to test the discriminative capability. Chen et al. [11] evaluated the performance of 13 features consists of color, structure, and texture features. To perform classification, k-nearest-neighbor (KNN) classifier and the support vector machine classifiers (SVM) are employed and the decision level fusion is performed to improve the performance of scene images. Tokarczyk et al. [59] proposed to use integral images and extract discriminative textures at different scale levels of scene images. The features are named as Randomized Quasi-Exhaustive (RQE) which are capable of covering a large range of texture frequencies.
In bag-of-words (BoW) framework, conventional coding methods focus on a single grid size or block based response to extract spatial features while not taking into account the other useful filter responses. An image often consists of a single object, but sometimes several objects or particles also appear in the surrounding. Although researchers showed that by incorporating relative spatial information, such as distance or angular directions, the technique becomes less effective as the codebook size increases. To address these challenges, Savarese et al. [51] introduced integral correlograms to capture spatial co-occurrences of features, which are easy to compute with respect to basic geometric transformations. The combination of correlograms and visual words is explored by forming a co-occurrence matrix of visual words as a function of distance. The co-occurrence matrix models the typical spatial correlations between visual words in object classes, yielding invariance to rotation and translation. In order to improve the classification performance, the co-occurrence matrix with the orderless BoW is also utlized in [74]. However, the correlogram matrix takes expensive computation power and memory cost [51].
Khan et al. [31] investigated multiple hand-crafted color features in bag-of-word model. In their work, color and shape cues are used to enhance the performance of the model. Yang et al. [73] improved the BoW model based on the spatial cooccurrence kernel, where two spatial extensions are proposed to emphasize the importance of spatial structure in geographic data. Vigo et al. [63] proved that incorporating color and shape in both feature detection and extraction significantly improves the bag-of-words based image representation. Sande et al. [61] proposed a detailed study about the invariance properties of color descriptors. They concluded that the addition of color descriptors over SIFT increases the classification accuracy by 8 percent. Lazebnik et al. [33] proposed a spatially hierarchical pooling stage to form the spatial pyramid method (SPM). To improve the SPM pooling stage, sparse codes (SC) of SIFT features is merged in the traditional SPM [72]. To further enhance the spatial information, a new coding algorithm called Locality-constrained Linear Coding (LLC) is introduced that utilizes the locality constraints to project each descriptor into its local-coordinate system. Then, a max-pooling is used to integrate projected coordinates [64]. Computational efficiency is achieved by introducing a soft-assignment coding that computes the distance from a local feature to each word, and allows each feature vector to belong to multiple histogram bins [38]. However, the classification performance is limited to the newly generated sparse or local coding schemes. Boureau et al. [8] analyzed the various combinations of coding and pooling schemes through a comprehensive cross evaluation of several types of pooling and coding stages: hard assignment vs. soft assignment, the linear kernel vs. the histogram intersection  kernel and average pooling vs. max-pooling. Zhou et al. [79] introduced a hierarchical Gaussian mixture model for feature vectors at difference levels, and several Gaussian maps for its spatial layout. To alleviate noise directions and to further enhance the discrimination power, a supervised dimension reduction technique is introduced called Discriminant Attribute Projection (DAP). In contrast to kmeans clustering, a modified coding approach is introduced in [32] called Fisher vector coding. They extend BoW by using a Gaussian mixture model to encode spatial layout. To encode the appearance of local features, representation of spatial layout is combined with the use of Fisher kernels. While their representation is computationally efficient and compact, their evaluations indicates marginal improvement over SPM. Gabriel et al. [47] presented a new method, called Sparse Spatial Coding (SSC). In their work, they build the dictionary with a set of random patches and code a descriptor using a spatial constraint. Their method is closely related to LLC [64].
In recent years, a large number of studies have been conducted to merge attention mechanism and multiscale learning based on deep learning models. Ghanbari et al. [21] proposed a dense-global-residual network to reduce the loss of spatial information and enhance the context information. The authors used a residual network to extract the features and global spatial pyramid pooling module to obtain more abundant multiscale features at different levels. Zhao et al. [78] proposed a deep learning model which simultaneously captures spectralspatial features of the target pixel and its neighboring pixels for classification. The authors used a center attention module that pays more attention to the features correlated to target pixels and reduce the number of parameters in the network via weighted sum of the spectral-spatial features. Zuo et al. [81] proposed a convolutional recurrent neural network to learn the spatial dependencies between image regions and enhance the discriminative power of image representation. The authors trained their model in an end-to-end manner where CNN layers are processed to generate mid level features and RNN layer is learned to encode spatial dependencies.Xue et al. [70] proposed a hierarchical residual network to extract multiscale and spectral features at a granular level. The authors used an attention mechanism to set adaptive weights for spatial and spectral features of different scales for the further improvement of the discriminative ability of extracted features. Ran et al. [50] proposed a multiscale context and enhanced channel attention model that employs PeleeNet as the backbone network. The authors improved the characterization ability of the convolutional neural network by proposing channel attention approach. Huang et al. [30] proposed an end-to-end deep learning model and employ multiscale feature fusion, a channel-spatial attention, and a label correlation extraction module. Specifically, a channel-spatial attention mechanism is used to fuse and refine multiscale features from different layers of the CNN model. Moreover, a label co-occurrence matrix is utilized to extract the label correlation information and embedded into the multiscale attentive features which increases the discriminative ability of their proposed model.
Mei et al. [43] proposed a sparse representation-based model with deep feature fusion. Multilevel features are extracted from different layers of convolutional neural networks to exploit the feature learning ability. Zhang et al. [77] proposed a multi-scale deep feature representation and the regionbased features selection. The model first filters the multi-scale deep features extracted from pre-trained convolutional networks and then fuses those features via their proposed fusion strategy. The authors utilized the complementarity between local and global features by exploiting the features of different scales and discarding the redundant information in features. Liang et al. [37] introduced a novel two-stream architecture combining global-based visual features and object-based features. The model first extracts the appearance visual features from the scene image using convolutional neural network and later detects the ground objects and finally constructs a graph to learn spatial features using a graph convolutional network. Tian et al. [58] proposed a multiscale dense network based on squeeze and excitation attention, dense connections and squeeze-and-excitation attention mechanism. The authors imposed two settings with computational constraints including budgeted batch classification (a fixed computational budget setting for sample classification) and prediction module that forces the network to predict the output.
Fu et al. [19] proposed a feature fusion architecture to generate a multiscale features hierarchy that augments the features of shallow layers with semantic representations and combine the feature maps of top layers with low-level information. The authors built a unified framework upon the regionbased convolutional neural network for arbitrary-oriented and multi-scale object detection. Cheng et al. [17] proposed a multi model fusion neural network with the combination of a convolutional neural network and a multilayer perceptron to estimate a fine-resolution population mapping. This model takes the local spatial information and global information from multisource data to estimate the fine-resolution population where a first-order space matrix of a geographic unit is used to characterize these information. Qu et al. [49] proposed a novel multiscale deeply supervised convolutional feature fusion module. The authors used multiscale feature fusion by using the high-level features and the low-level features with deep supervision that provides direct supervision to improve the performance of the model. Li et al. [36] proposed an adaptive multilayer feature fusion model to fuse different convolutional features with feature selection operation, rather than simple concatenation. The authors claimed that their proposed method is flexible and can be embedded into other neural architectures.
In overall, patch sampling or feature learning is a key step for building up an intelligent system either for CNN model or BoW-based approaches. In deep learning CNN models, convolutional layers convolve the local image regions independently, and pass its result to the next layer, whereas pooling layers summarize the dimensions of data. Due to wide range of image resolution and various scales of detail textures, fixed sized kernels are inadequate to extract scene features of different scales. Therefore, the focus has been shifted to multiscale and fusion methods in scene image classification domain, and existing deep learning methods are making full use of multiscale information and fusion for better representation.
A sampling step is generally required to select uniform, sparse, random, and dense representative subset of the images for subsequent analysis. The most popular BoW methods utilize a fixed grid size and extract features at different scales [39,24,3]. This is because it is still not clear which sampling strategy is suitable, and even no optimal patch size can be set for the scene classification of high-resolution remote sensing images. Due to the unique nature of remote sensing images, discriminative patch sizes are needed that can focus on the significant regions within the image itself in the procedure of feature learning. Thus, it stimulates us to research whether the correlation of scene categories can increase the discriminative ability of BoW features. NBCL provides a natural approach to

III. PROPOSED METHOD
The proposed approach is divided into three indispensable components: (1) estimation of multi-neighborhood regions (2) information fusion and (3) a BiLSTM based sub-network for classification purpose. We first describe the procedure of neighborhood estimation with multi-scale filtering. Next, we describe the proposed fusion along the classification process of BiLSTM network. The overall procedure of the proposed approach is illustrated in Fig 3.

A. Features extraction using multi-neighborhood regions
In order to explore the spatial relationship between scenes or sub-scenes, we propose to extract multilevel features with the objective that different regions contain discriminative characteristics that can be used to extract more meaningful information and to correctly classify target samples. Based on our observation, the size of the neighborhood has a great impact on the scene representations and classification performance. To achieve this, we first define a region over the entire image, where the patch sizes used are (4×4), (6×6), (8×8), (10×10), and the sliding step is set to be 1 pixel. An example image with the neighborhood patch (4 × 4) size is provided in Fig.4 to show how the local descriptor is exploited in each part of the region. Here, the definition of different neighborhood size is considered to be small, medium, or large regions. Thus, four kinds of sizes are used for each image to ensure that the output is full of content information. To achieve multiscale information of each region, we propose to use multi-scale filtering motivated by the fact that it can adaptively integrate the edges of small and large structures by removing image noise. Inspired by the Gaussian scale-space [67], images are repeatedly smoothed with appropriately sized Gaussian kernels by convolving over image: where * is the convolution operator and G(a, b, σ i ) is the Gaussian kernel with the standard deviation * defined by . In this way, noise and illumination factors are suppressed by using these smoothed images and coarser structures are emphasized since no new structures are created after smoothing. Therefore, the proposed idea takes the advantages of both schemes. We experimentally study the outcome of this choice in ablation study section. Once the scale space has been built, we utilize SURF descriptor [4] to extract the features within a bounded search area. For an image I, image scales m i = (i = 1, 2, ..., n) are denoted as x mi = (i = 1, 2, ..., n). Formally, for each smoothed image the feature extracted from the SURF is illustrated as below: where n is the number of scales, i is the index of scale, x mi is the i th scale, x mi is the region at i th scale, and f mi is the SURF feature for x mi . In order to construct the visual vocabulary, SURF features are clustered through the k-means clustering process and mapped to a specific codeword, thus, can be represented by a histogram of visual words. The histogram becomes a final representation of the image.

B. Information fusion
Information fusion is the process of combining multiple pieces of information to provide more consistent, accurate, and useful information than a single piece of information. In general, it is divided into four categories: decision level, scale level, feature level, and pixel level [55]. Among them, the feature level fusion has comparatively a shorter history but is an emerging topic in a remote-sensing domain. The spatial relation between the proposed regions can improve scene classification in two aspects. First, aggregating the information of a neighborhood and its adjacent neighborhoods assists in recognizing the features that accurately represent the scene type of the image. For instance, determining whether farmland belongs to a forest field or a meadow requires information about its neighboring area. Second, the natural relationship of the spatial distribution pattern of a scene helps to infer the scene category. Industrial area, for instance, is likely planar, and the runway is always linear. Therefore, we select to combine four different regions based on multiscale features, in the aim to obtain more informative and relevant features to represent the input image. Each input image I produced four sets of multiscale features, that is Q 1 , Q 2 , Q 3 , and Q 4 representing four sets of features. The final fused vector illustration is obtained as: where i is the imaginary unit.

C. Recurrent Neural Network (RNN)
The multiscale histograms from different regions provide crucial information for understanding the spatial structure. We concatenate them into a final representation, which is then used as an input to Bidirectional neural network [52]. The Bidirectional Long Short-Term Memory Networks learns the correlations of features and encodes the feature histograms based on the memory cell (M t ), and are competent for retaining track of the dependencies between the elements in the input sequence. It consists of an input gate (i t ), an output gate (o t ) and a forget gate (f t ). The input gate controls the information flow into the cell by multiplying the cell's nonlinear transformation of inputs n t . The output gate governs how much information from the cell is employed to calculate the output activation of the LSTM unit. The forget gate decides the amount to which a value remains in the cell. The LSTM unit updates for time step t are: where x t denotes the input at the current time-step, i t denotes the current cell state. f , i and n represent the input gate activation, forget gate activation and output gate activation respectively, σ illustrates the logistic sigmoid function and represents element-wise multiplication.

IV. DATASETS AND EXPERIMENTAL SETUP
In this section, we first provide a brief description of four databases that are used to evaluate our method. Then, the implementation details and ablation analysis are discussed and the results are compared with state-of-the-art methods.

UC Merced Land Use Dataset (UC-Merced):
This dataset was obtained from the USGS National Map Urban Area with a pixel resolution of one-foot [73]. It contains 21 distinctive scene categories and each class consists of 100 images of size 256 × 256 × 3. Inter-class similarity, for example, highway and architecture scenes can be easily mixed with other scenes, such as freeways and buildings, which makes this dataset a challenging one.
WHU-RS Dataset: It was collected from satellite images of Google Earth [54]. This dataset consists of 950 scene images and 19 classes with a size of 600 × 600. Each image varies greatly in high resolution, scale, and orientation, which makes it more complicated than the UCM dataset.
Aerial Image Dataset (AID): There are 10000 images in         AID dataset, which are categorized into 30 scene classes [68]. Each class contains images ranging from 220 up to 420 with the fixed size of 600 × 600 pixels in the RGB space. The pixel resolution changes from about 8 m to about half a meter. NWPU-RESISC45 Dataset: It consists of 31,500 remote sensing images divided into 45 scene classes, covering more than 100 countries and regions all over the world [13]. Each class contains 700 images with the size of 256 × 256 pixels. This dataset is acquired from Google Earth (Google Inc.), where the spatial resolution varies from 30 to 0.2 m per pixel. This is one of the largest datasets of remote sensing images and is 15 times larger than the most widely-used UC Merced dataset. Hence, the rich image variations, high interclass similarity, and the large scale make the dataset even more challenging.

B. Implementation details
To evaluate the performance on the above-mentioned datasets, the BoW is used as the base architecture with four distinct neighborhood sizes and seven adjacent Gaussian scaled images, i.e., [1.6, 2.5, 3.5, 4.5, 5.5, 6.0, 6.4]. The vocabulary size of k in the remote-sensing domain varies from a few hundred to thousands. We set the size of visual vocabulary to 15000 for UC Merced, AID, NWPU, and 10000 for the WHU-RS dataset. The BiLSTM is trained using the Adam optimizer with gradient threshold 1, while the minibatch size of 32 with hidden layer dimension of 80. Initializing the BiLSTM with the right weights is a challenging task because standard gradient descent from random initialization can hamper the training of BiLSTM. Therefore, we set the recurrent weights with Glorot initializer (Xavier uniform) [22] which performs the best in all scenarios of our experiments. To decrease the computation complexity on AID and NWPU datasets, we only use four Gaussian scaled images where the highest filter image takes a weight of 4.5, and the lowest 1.6.

C. Ablation study
We thoroughly validate the performance of each neighborhood size by performing an ablation study. In Table I, we have reported results of estimating the proposed NBCL on UC Merced and WHU-RS datasets. Note that, these experiments are performed by using four different neighborhood sizes with six Gaussian scaled [1.6, 2.5, 3.5, 4.5, 5.5, 6.0] images. Our one-stage detection method on WHU-RS dataset with the size of (4 × 4) achieves 86.10% accuracy and the numerical results of each category are shown in Fig.5 (a). It can be  [4], SPM-BOW [33], and ours. (b) Comparing the performance on UC Merced dataset using SURF-BOW [4], SPM-BOW [33], and ours. (c) Comparing the performance on NWPU dataset using SURF-BOW [4], SPM-BOW [33], and ours. seen that several classes such as bridge 63%, viaduct 71%, railway station 71%, meadow 78%, and desert 76% are highly misclassified. In Fig.5 (b), we show that when the neighborhood size (6 × 6) increases, the classification performance is improved up to 88%, 80%, 77%, and 90% for bridge, viaduct, railway station, meadow, and desert, respectively. The overall classification is improved up to 89%, which is 3% higher than the (4 × 4) size. We further increase the size up to (8 × 8) and notice that viaduct, railway station, meadow are 100% correctly classified and achieve 92% classification accuracy. Results demonstrate that different neighborhood sizes play different roles in classifying remote sensing scene images, and it is hard to determine an optimal size to obtain the best result. On the other hand, 4 × 4 block size outperforms other block sizes in UC Merced dataset while 8 × 8 block size is more effective for the WHU-RS dataset as shown in Table I. 1) Scale Factor of Gaussian Kernel: Fig.6 shows the classification performance of each scaled image based on 10 × 10 neighborhood size. The NBCL extracts multiscale dense features according to the scale factor σ to control the Gaussian kernel. It can be observed that with the increase of scale factor, the performance first improves and than gradually decreases after the 6.0 scaled image. We conclude that including a certain range of Gaussian smoothed images can improve the performance, but too much of them degrade the performance.
2) Codebook learning: We quantitatively analyze the performance with the SURF descriptor and SPM method in the bag-of-words framework. An engaging question is how much the performance can be improved by defining the proposed spatial locations with multiscale information. With this in mind, we set different vocabulary sizes for WHU, UC Merced, and NWPU datasets. The respective outcomes can be found in Fig.7 (a) (b) and (c). One can see that even the proposed onestage detection method with the neighborhood size of (4 × 4) significantly outperforms the SPM method. Similarly, using the SURF descriptor in the BoW framework cannot achieve the best performance and provides more than 20% lower accuracy with ours on all databases.
3) Visualization of Feature Structures: One of the advantages of the proposed approach is that we can interpret the classification process of the model. Especially for each stage, we can see how the features are structured into data space and their impact along the different classification stages. Taking this into consideration, we employed the "t-distributed stochastic neighboring embedding" (t-SNE) algorithm [62] and illustrated the derived embeddings into three separated processing stages: 1) one-stage learning, 2) combined learning (NBCL), and 3) BiLSTM classified features for the WHU dataset. The features with the patch size of 4 × 4 in Fig.8  (a) show that most classes are strongly correlated, which makes the classifier (BiLSTM) hard to separate them. We also visualize the clusters by fusing all the neighborhood features in Fig.8 (b). The derived clusters indicate that the proposed fusion reduces the correlation between similar classes and can capture more variability in the feature space. Moreover, it could be noticed from Fig.8 (c) that all the classes are well separable which could potentially lead to a better performance when training BiLSTM on remote sensing dataset.
D. Performance comparison with state-of-the-art methods 1) NWPU-RESISC45 Dataset: To demonstrate the superiority of the proposed method, we evaluate the performance against several state-of-the-art classification methods on the NWPU dataset is shown in Table II. Especially, we choose mainstream deep learning and BoW based methods and compare the performance of scene classification. It could be observed from Table II, the proposed approach, by combining all neighborhood-based features, achieved the highest overall performance of 94.20% and 97.13% using 10% and 20% training ratios, respectively. It is worth mentioning that NWPU is much more difficult than the other three datasets and our proposed method outperforms the previous state-of-the-art method by a margin of 4% under the training ratio of 20%. The classification performance of the proposed NBCL shows the effectiveness of combining global-based visual features on the NWPU dataset. Fig.10 illustrates the confusion matrix produced by our proposed method (NBCL) with the 20% training ratio. Each Method Accuracy (Mean±std) AlexNet+sum pooling [2] 94.10±0.93 VGG-VD16+sum pooling [2] 91.67±1.40 SPP-Net [25] 96.67±0.94 GoogleNet [68] 94.31±0.89 VGG-VD16 [68] 95.21±1.20 DCA fusion [10] 96.90±0.77 MCNN [41] 96.66±0.90 D-CNN [16] 98.93±0.10 Triple networks [40] 97.99±0.53 VGG-VD16 +AlexNet [35] 98.81±0.38 Fusion by concatenation [45] 98.10±0.20 MDFR [77] 98.02±0.51 APDC-Net [5] 97.05±0.43 BoWK [46] 97.52±0.80 Attention GANs [76] 97.69±0.69 AlexNet+SAFF [9] 96.13±0.97 VGG-VD16+SAFF [9] 97.02±0.78 Color fusion [1] 98.10±0.00 Graph CNN [20] 99.00±0.43 IDCCP [65] 99.05±0.20 SEMSDNet [58] 99.41±0.14 NBCL (The proposed) 99.57±0.36 row represents the percentages of correctly and incorrectly classified observations for each true class. Similarly, each column displays the percentages of correctly and incorrectly classified observations for each predicted class. One can see that the classification performance of 41 categories is greater than 95% where only the 14 categories have achieved more than 95% in the previous methods [37]. However, one common challenge is found that the church and palace are two confusing categories which limits many existing works to surpass the performance [37]. In our case, 25% of images from church are mistakenly classified as a palace which is 1% high misclassification than the CNN + GCN [37]. On the other side, only 0.3% of images from the palace are mistakenly classified as an industrial area where the previous methods achieve 67% [69] and 70% [37] performance for the palace class. By analyzing the confusion matrix on NBCL, the airport, church, and commercial area are the only challenging classes for our proposed method. Thus, the experimental results demonstrate the proposed method improves the discriminative ability of features and works well on the large-scale NWPU-RESISC45 dataset.
2) AID Dataset: We evaluate and report the comparison results against the existing state-of-the-art classification methods for the AID dataset in Table III. It could be observed that NBCL achieved the overall accuracy of 96.11% and 98.43% using 20% and 50% training ratios, respectively. As can be seen from Table III, our method outperformed the SEMSDNet [58] with increases in the overall performance of 1.88% and 0.79% under both training ratios. Thus, our proposed method, by combining all the neighborhood features, verifies the effectiveness of multilevel and multiscale feature fusion. Fig.11 represents the confusion matrix generated by NBCL with the 50% training ratio. As can be seen from Fig. 11 Figure 9. The influence of the training sample ratios with different methods such as feature combination by MTJSLRC [80], Pretrained Alexnet [25], and kernel codebook coding [53].
classification performance of all the categories is higher than 95% and only the square category provides the lowest accuracy up to 97%. Specifically, 4 of images from the square are mistakenly classified as stadium, 3 of images from commercial are misclassified as dense residential. The five categories such as school, square, park, center, and resort are very confusing categories, which leads many existing works to be unable to get a competitive performance [58]. For instance, SFCNN [69] and the CNN + GCN [37] attain 70% to 91% accuracy for the class of resort while our method achieves 100% accuracy. It confirms that despite the high interclass similarity, the proposed method is capable of extracting robust spatial location information to distinguish these remote sensing scene categories.

3) UC Merced Dataset:
The evaluation results on the UC Merced dataset are presented in Table IV by using 80% training ratio. The proposed method achieves 99.57% accuracy and competes the previous BoW [46] approach by a margin of 2.05%. Moreover, the effect of the number of training samples on the UC Merced dataset is also examined by selecting 20%, 30%, 40%, and 80% as training samples and visualized in Fig. 9. It can be noticed that in comparison with other fusion methods, the proposed fusion method is superior from the start even with a 10% training sample ratio. For further evaluation, a confusion matrix of the UC Merced dataset is shown in Fig.12. A total of 3 images are misclassified in this dataset where buildings and mobile home parks are found to be challenging categories for our proposed method. Thus, the proposed method is effective to classify most of the scene categories.

V. CONCLUSION
In this paper, we explore the spatial dependencies between different image regions and propose a discriminative neighborhood-based collaborative learning to address the problem of interclass similarity in remote sensing scene images. Multi-neighborhood local patches are firstly proposed to preserve the spatial information between scenes or subscenes. Afterward, the preserved spatial locations are used in downscaled versions of the input image to get dense coverage of the entire scene in a scale-space pyramid. We show that multi-neighborhood learning in BoW model significantly improves the recognition performance compared to using the single level BoW alone. By combining BoW and RNN, we can learn discriminative feature representations. Experiments are conducted on four publicly available datasets, and the results demonstrate the different distribution of spatial location and visual information is crucial for scene classification. The proposed approach is expected to have advantages over single scale BoW or traditional CNNs methods, especially in the situation where a large number of training data is not available. We concluded the spatial information plays an important role and encourages multi-neighborhood strategy in scene classification. Our future work is to design an end-toend method that can automatically obtain multilevel and multiscale features without human intervention.

VI. DECLARATION OF COMPETING INTEREST
The authors have no conflict of interest that could have appeared to influence the work reported in this paper.

VII. ACKNOWLEDGMENTS
This work is supported by the University of Chinese Academy of Sciences, Beijing, China and the Center for Machine Vision and Signal Analysis (CMVS) in the Faculty of Information Technology and Electrical Engineering (ITEE) at University of Oulu, Finland. The financial support of the world academy of sciences (TWAS) is gratefully acknowledged. APPENDIX Figure 10, Figure 11, and Figure 12 represent the confusion matrices of NWPU-RESISC45, AID, and UC Merced datasets, respectively. The source code for reproducing the results of this research would be available to the research community upon the publication of this work.