3M-CDNet: A Universal Powerful Benchmark for Remote Sensing Image Change Detection

—While deep learning-based methods have gained popularity and have made remarkable progress in remote sensing (RS) image change detection (CD), the limited amount of available data hinders the performance of most supervised methods. The CD networks transferred or derived from other fields can be fronted with a weak generalization capability. Developing a universal benchmark for performance evaluations based on the available datasets is urgent. To address these problems, we proposed a lightweight network, termed 3M-CDNet, which only requires about 3.12 M parameters. The lighter the network, the easier it is to train and alleviate overfitting the limited amount of data, resulting in a better generalization capability. 3M-CDNet has a flexible modular design that achieves performance improvements by incorporating plug-and-play modules. 3M-CDNet gains accuracy improvements in two ways: (1) the application of deformable convolutions ( DConv ) in the backbone network to gain a good geometric transformation modeling capacity for CD and (2) the application of an effective two-level feature fusion strategy to enhance the feature representation capacity. 3M-CDNet gains a good generalization capacity by incorporating effective “tricks” to alleviate overfitting, in which online data augmentation ( Online DA ) is applied to increase the diversity of the training samples, and Dropout regularization is applied in the classifier. Extensive ablation studies have proved the effectiveness of the core components. Experiment results suggest that 3M-CDNet outperforms state-of-the-art methods on several optical RS datasets and serves as a new universal benchmark. Specifically, 3M-CDNet achieves the best F1-score, i.e., LEVIR-CD (0.9161), Season-Varying (0.9473), and DSIFN (0.7031).


I. INTRODUCTION
HANGE detection (CD) aims to identify and locate the change footprints from multitemporal remote sensing (RS) images acquired over the same geographical region at different times. CD has a wide range of applications, such as land cover and land use identification, urbanization monitoring, and damage assessment [1]. With the rapid development of RS techniques, optical RS images are among the most representative data widely utilized for CD. Especially the highand very high-resolution RS images, which can reflect abundant spectral and spatial information of geospatial objects, allow us to retain more details and obtain high-quality change maps. In this study, the main concern was binary CD based on optical RS images. CD methods usually generate a pixel-level change map, in which pixels are classified as changed or unchanged.
With the impressive breakthroughs made in deep learning, The authors are with the School of Instrumentation and Optoelectronic Engineering, Key Laboratory of Precision Opto-mechatronics Technology, CD methods have gradually evolved from traditional [2] to deep neural network (DNN)-based methods [1]. Supervised CD methods built on convolutional neural networks (CNNs) have gained popularity and show promising performance. Some attempts adopted the U-shaped architecture [3]- [5], which concatenates feature maps from different levels through multiple skip connections to improve accuracy. These studies demonstrated that both high-level semantic information and low-level detail information are important in CD. Unfortunately, which multilevel feature fusion strategy is the better is not clear, and dense skip connections bring about heavy computational costs. Alternatively, recent works proposed attentionmechanism-based networks to learn discriminative features and alleviate distracting by pseudo-changes [6]- [10]. However, they usually require a large amount of data to achieve satisfactory performance; otherwise, it is easy to overfit.
Despite the increasing number of raw RS images, the timeconsuming and labor-intensive work of manual interpretation still hinders the development of CD methods considering the data-hungry nature of DNNs. Currently, only a few open available labelled datasets can be used for model training and evaluation, such as LEVIR-CD [6], Season-Varying [11], DSIFN [8], HRSCD [4]. Unfortunately, all of them have less data than the ImageNet and COCO datasets. While DNN-based CD methods have made remarkable progress, the limited data hinder the performance of most supervised methods. The CD networks derived from other CV tasks can be fronted with a weak generalization capability. Thus, overfitting becomes one of the main concerns. To address the problem, a lightweight CD model with few parameters becomes an intuitive solution.
Moreover, due to some works adopting different datasets for evaluation, it is difficult to say which one achieves the best performance. Although some methods are evaluated on the same dataset, unfortunately, they adopt different criteria to split the datasets [6], [7], which makes it difficult to compare them directly. In addition, only a few works are openly available to the public. The lack of implementation details results in poor reproducibility, which heavily hinders the development of CD algorithms from research to application. Therefore, developing a universal benchmark based on available datasets is urgent.
The main concerns of this study are as follows: (1) achieve a good generalization capability and (2) gain considerable accuracy improvements. Inspired by Occam's Razor ("Non sunt multiplicanda entia sine necessitate", or "Entities are not to be multiplied without necessity"), we propose a lightweight network called 3M-CDNet. To provide a universal benchmark with good reproducibility, all the implementation details will be 1) This paper presents a lightweight but powerful 3M-CDNet as a universal benchmark for CD that only involves about 3.12 M parameters. 3M-CDNet has a modular design with high flexibility for easy plug-in of plug-and-play modules, e.g., DConv [12], to gain performance improvements.
2) To the best of our knowledge, this study is among the first to incorporate DConv into the backbone network to enhance the geometric transformation modeling capacity for CD.
3) This study explores a "bag of tricks" to achieve considerable performance improvements. Specifically, Online DA and Dropout regularization [13] were used to improve the generalization capability. A two-level feature fusion strategy was applied to improve the feature representation capacity.

II. PROPOSED METHOD
We proposed a universal benchmark for CD, termed 3M-CDNet, which only involves about 3.12 M parameters. As shown in Fig. 1, 3M-CDNet mainly consists of two core components: (a) DConv-based backbone and (b) Pixelwise classifier. The former is used for feature extraction from input , e.g., taking as an input with 6 bands a pair of bitemporal RGB images. The latter is used to classify the extracted features into two classes, and then generate a binary change map , where pixels are either changed or unchanged. 3M-CDNet has a modular structure with high flexibility. It allows achieving performance improvement by incorporating some plug-and-play modules, such as DConv. A. Network Architecture 1) DConv-based Backbone Network. As shown in Fig. 1(a), the backbone network of 3M-CDNet is composed of the , , and . The main concern is to reduce the size of input through consecutive downsampling and convolution operations, and extract features maps with varying degrees of semantics. Specifically, consists of three stacked convolutional layers followed by a MaxPool layer.
is applied to downsample the input by 4 times and transform into a 3-D tensor , of which the spatial resolution is 1/4 of the input size. and are feature maps extracted from and , respectively. (a) Introducing residual network [14]. When the CNN goes deeper, which could hamper the convergence, it may suffer from a degradation problem. Therefore, and adopt bottleneck residual blocks [14] as basic units, since residual blocks have the advantage of alleviating the degradation problem and promoting convergence during training. As shown in Fig. 1(c), bottleneck blocks can be formulated as follows: , (1) where and are the input and output tensors of the lth residual block, respectively.
indicates the residual function, i.e., the right branch that consists of three stacked convolution layers.
indicates the identity mapping function, i.e., the left branch.
applies a downsampling projection shortcut through a Conv11_BN layer only if is set to 2, e.g., the first block of , otherwise applying an identity shortcut.
is the rectified linear unit activation function for enhancing the non-linear fitting ability, and batch normalization (BN) is also applied at the tail end of each convolution kernel to facilitate training procedure more stable.
(b) Introducing deformable convolutions [12]. Let and denote the feature at location p of the input and output feature maps, respectively. Given a convolution kernel of K sampling locations, let and denote the weight value and default offset for the kth location of the kernel, respectively, e.g., a kernel with . The deformable convolution [12] can be formulated as follows: , (2) where and are the learnable 2-D offsets and modulation factor for the kth location, respectively, and enumerates the weights of the kernel. More precisely, two convolution layers and of the same kernel size are separately applied over the input feature maps to obtain and . Due to the fractional coordinate  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 , the value of is calculated based on the values of the four surrounding integer points by bilinear interpolation. DConv consists of two steps: (1) generate deformable feature maps from the input feature maps based on the learned offsets in the x and y directions and apply the learned modulation factors to modulate the activation of each location; and (2) apply a regular convolution over the deformable feature maps and then generate the output feature maps (see Fig. 1(d)).
DConvs are applied to replace all the convolution layers of the bottleneck blocks in and , termed DCNBottleneck (see Fig. 1(c)). In this way, the geometric transformation modeling capability of the backbone network can be enhanced. Therefore, 3M-CDNet has an advantage of addressing the pseudo-changes and overcoming the adverse effects of scale variations of objects with various shapes.
2) Pixelwise Classifier. The pixelwise classifier adopts a plain design with only four convolution layers. To obtain a change map of the same spatial resolution as the input, a 2-fold bilinear upsampling is applied after the first and last convolution layers. The classifier classifies the extracted features by the backbone network into two classes and predicts a change probability map through a sigmoid layer, of which the values lie in the range . Finally, the binary change map can be generated by applying thresholding over with a threshold of 0.5.

B. Bag of Tricks for Performance Improvements
We used a "bag of tricks" to gain considerable performance improvements. In particular, multilevel feature fusion serves as an indispensable role in improving accuracy of CD networks. Meanwhile, an intuitive solution to achieve good generalization capability comes from two aspects: (1) develop a lightweight model with few parameters and apply regularization; and (2) increase the diversity of the training samples.
1) Multilevel Feature Fusion Strategies. Previous works demonstrate that both high-level semantics and low-level detail information are important in CD. However, it is rare for these studies to clearly state which feature fusion strategy is effective. This work compares three kinds of feature fusion strategies as follows: (1) only apply the high-level feature maps , termed the one-level strategy; (2) apply the fusion feature maps obtained by concatenating the high-level and lowlevel along the channel axis, i.e., , termed the two-level strategy; and (3) apply the two-level strategy and then apply an extra fusion feature map , which is obtained by concatenating the output feature maps of the first convolution layer of the classifier and extracted by , termed the three-level strategy. Although some channel attention modules, e.g., CBAM [15], can be applied for improvements, the concatenation operation was selected for its simplicity to achieve high computational efficiency using the minimal number of parameters.
2) Dropout Regularization [13]. Dropout is a simple yet effective way to prevent neural networks from overfitting.
During training, Dropout randomly drops units from the network with a certain probability , which can be equivalent to training numerous different networks simultaneously, i.e., , where indicates a binary mask of the same size with the feature map , and indicates the element-wise multiplication operation.
is randomly generated from a Bernoulli distribution with a probability , and the units of the feature maps corresponding to the locations of zeros are to be discarded during training. At test time, a neural unit is always presented, and the weights are multiplied by so that the output of the unit would be the same as the expected output at training time, i.e., . As shown in Fig. 1(b), two Dropout layers with probabilities of 0.5 and 0.1 are applied at the tail end of the classifier's convolution layers. 3) Online Data Augmentation. Data augmentation (DA) is a simple yet effective technique for regularizing the network. DA can be used to simulate scale variations, illumination variations, and pseudo-changes, such as the spectra changes between bitemporal images. Online DA means that do DA only when training instead of expanding the training set at the cost of expensive training time. Online DA was randomly applied over every batch data with a probability of 0.8 by randomly shiftingrotating-scaling with padding zeros, rotating by , , and , and flipping in horizontal and vertical directions, and applying color jitter. Each kind of augmentation was randomly applied with a probability of 0.5.

C. Loss Function Definition
During training, network parameters are iteratively updated by minimizing the loss between the forward output of 3M-CDNet and the reference change map with the BP algorithm according to a specific loss function. Since CD aims to classify the pixels as changed or unchanged, binary cross-entropy (BCE) loss is a natural candidate. Due to the widespread class imbalance, dominant unchanged pixels would make models tend to collapse. Thus, a soft Jaccard term is introduced, and the loss function can be formulated as shown in Equation (3). , (3) where two terms indicate the BCE loss and soft Jaccard with the weights and respectively, and is empirically set to 0.7. N is the number of training samples, and and are pixels of predicted change maps and the reference, respectively.

A. Evaluation Metrics and Experiment Settings
The most common metrics related to the changed category were adopted for the quantitative evaluation, including overall accuracy (OA), precision (Pr), recall (Re), F1-score (F1), and the intersection of union (IoU). F1 and IoU are comprehensive indicators; the higher the value, the better the performance.
The proposed 3M-CDNet was implemented using PyTorch framework and optimized by minimizing Eq. (3) through the AdamW optimizer [16] with and , of which the initial learning rate and weight decay were set to 0.000125 and 0.0005, respectively. The minibatch size was set to 16 on an NVIDIA RTX 3090 GPU with 24 GB memory. For evaluation, we adopted three representative datasets, i.e., LEVIR-CD [6], Season-Varying [11], and DSIFN [8]; we applied the criteria as recommended by the creators to split the datasets (see Table I).

B. Ablation Studies
Ablation studies have been conducted to verify each component's contribution of 3M-CDNet. Table II presents the quantitative results on several public datasets, where "w/o" and "w/" mean "without" and "with", respectively.
1) Effects of Multilevel Feature Fusion Strategies. The first three rows of the Method applied with all components were used to verify the contributions of three different kinds of multilevel feature fusion strategies. Table II suggests that the two-level strategy always achieves the best performance in terms of F1, i.e., LEVIR-CD (0.9161), Season-Varying (0.9471), and DSIFN (0.7031). Compared with the one-level strategy that lacks low-level details, it increased IoU and F1 significantly by about 2.09% and 1.24% on LEVIR-CD, 2.75% and 1.55% on Season-Varying, as well as 4.96% and 4.31% on DSIFN, respectively. Unfortunately, the impact of the threelevel strategy is negligible when using samples with size for model training, e.g., as shown in the LEVIR-CD and DSIFN. However, the three-level strategy helps to achieve an improvement of F1 (1.01%) and IoU (1.79%), compared to the one-level strategy, because it provides more spatial details in the case of using samples with the smaller size of on Season-Varying. We can conclude that the two-level strategy is enough for improvements in our case, while introducing either insufficient or excessive features could bring about an unexpected degradation problem.
2) Effects of Different Tricks. The last four rows of the Method based on the two-level strategy were used to verify the effectiveness of Online DA, Dropout, and DConv. Table II suggests that Online DA makes the most impressive contributions on the smaller datasets, i.e., LEVIR-CD and DSIFN. Compared with the situation without DA, applying DA achieves considerable improvements of F1 (1.43%) and IoU (2.40%) on LEVIR-CD. Similar situations can be observed on DSIFN. It demonstrates that Online DA is an effective trick to achieve immediate gains through improving the diversity of samples, especially when lacking enough training samples. Otherwise, we can observe that its impact is negligible in the case of Season-Varying dataset with sufficient samples. Meanwhile, Dropout can be an effective complementary regularization with Online DA for achieving good generalization capacity. In addition, Table II shows that DConv serves as an indispensable component for achieving high accuracy. For an instance, the last row shows that performance drops significantly without DConv, where F1 decreased by about 1%, 2.03%, and 2.54% on the three datasets, and IoU decreased by 1.69%, 3.60%, and 2.97%, respectively. In particular, DConv helps to improve the generalization capacities in the challenging case of DSIFN, in which training and testing samples are collected from different cities. Table III shows the computational costs (GFLOPs) and number of parameters (M) of the CD networks. When calculating the computational cost during testing, it takes as an input and fixed-size. 3M-CDNet costs 23.66 G and 94.64 GFLOPs according to the input sizes.

C. Comparisons with Other Approaches 1) Comparisons of Network Parameters and Computational Cost.
2) Comparisons on LEVIR-CD. W-Net [3], FC-EF-Res [4], and Peng et al. [5], and attention-based methods STANet [6], DDCNN [7], and FarSeg [17] were selected as benchmarks. In particular, STANet was proposed by the dataset's creator. Table  IV presents the quantitative results and suggests that 3M-CDNet outperforms the state-of-the-art FarSeg/DDCNN (w/ Online DA) and achieves the best performance in terms of IoU (0.8452) and F1 (0.9161), which increased by about 1.66% and 0.98% compared to FarSeg, respectively. Moreover, 3M-CDNet makes a better trade-off between precision and recall than other approaches. 3M-CDNet has broken the record and become the new benchmark for the LEVIR-CD dataset. For intuitive comparisons, the visual results are shown in Fig. 2. Compared with the change maps generated by other methods, 3M-CDNet achieves more complete and accurate boundaries and higher internal compactness on objects with various scales and shapes, which is more consistent with the reference. Moreover, 3M-CDNet successfully identifies the tiny gap among the crowded building groups and overcomes the pseudochanges caused by the spectra changes. However, other methods suffer from more false alarms.
3) Comparisons on Season-Varying. More attention-based methods IFN [8], BA 2 Net [9], and DASNet [10] were selected as benchmarks for the quantitative evaluation. Table V suggests that 3M-CDNet is consistently superior to other approaches in  Fig. 3 shows even in the challenging case of large seasonal variations (summer-to-winter/autumn), 3M-CDNet accurately identifies the changed objects of various scales and appearance, including the road changes and building changes.

IV. CONCLUSIONS
The universal network 3M-CDNet is proposed for CD. The lightweight 3M-CDNet exhibits a good generalization capacity by incorporating effective tricks, i.e., Online DA and Dropout regularization. The flexible 3M-CDNet achieves accuracy improvements in two ways. One is by applying DConv in the backbone network to gain good geometric transformation modeling capacity for objects with various scales and appearance. Another is by applying a two-level feature fusion strategy to enhance the feature representation capacity. Experiment results show 3M-CDNet broke the record for accuracy compared to the state-of-the-art methods and is the new benchmark on three public datasets.