STransFuse: Fusing Swin Transformer and Convolutional Neural Network for Remote Sensing Image Semantic Segmentation

—The convolutional neural network (CNN) bolstered the application study of remote sensing images. However, the ﬁxed size of the receptive ﬁeld limits CNN’s ability in semantic modelling. Although the Transformer model based on self-attention may model global semantic information, it remains a challenge to work effectively in the remote sensing ﬁeld with limited data. Leveraging the beneﬁts of both Transformer and CNN, a new model is proposed in this article as a semantic segmentation method for remote sensing images which is constructed by fusing Swin Transformer and CNN and is thus named as STransFuse model. On the one hand, the STransFuse model makes use of the capability of the transformer network to model global semantic information in remote sensing images. On the other hand, it utilized the capability of the CNN with pre-training weights to solve the problem of low performance of the Transformer network when the amount of remote sensing image data is small and to provide the contextual location information of the image. An Adaptive Fusion Module (AFM) is created to adaptively fuse the feature maps produced by the Transformer network and CNN to improve the model’s feature representation capacity. The OA (Overall Accuracy) of the STransFuse model is 1.36% higher than the baseline on the Vaihingen dataset, and 1.27% higher than baseline on the Potsdam dataset. When compared with other state-of-the-art models, the STransFuse model performed competitively.


I. INTRODUCTION
C NN has performed admirably in the realm of computer vision. Fully convolutional networks (FCNs) [1] based on CNN have become the most popular image segmentation architecture. However, the scale of features contained in remote sensing images varies greatly, and large features of huge ground objects (e.g., buildings) in an image will occupy a large proportion of the image, and FCNs cannot acquire the contextual information of the image well due to the limitation of the fixed perceptual field of the convolution kernel. To address this issue, a CNN-based model will employ the pooling method to reduce the resolution of the feature map to obtain a global representation of feature information. However, the global pooling method will cause the model to lose information about small targets of the image.
Some researchers [2]- [4] have attempted to tackle the above challenges by fusing multi-scale contextual data. By merging atrous convolutions to gather multi-scale contextual information, Chen et al. [3] improved the Atrous Spatial Pyramid Pooling (ASPP) module. To improve the expression of feature map information, U-Net [5] uses an encoder-decoder structure to retrieve feature map information at different levels via skip connections. To mine the ability of global contextual information, the Pyramid Scene Parsing Network (PSPNet) [4] gathers contextual information based on different regions through the pyramid pooling module.
Furthermore, several researchers [6], [7] attempted to tackle the problem of a lack of network perceptual fields by utilizing self-attention [8]. To model semantic significance from spatial and channel dimensions, Fu et al. [6] proposed Compact Position Attention Module and Compact Channel Attention Module based on the self-attentive process. In the field of computer vision, the Transformer technique [9]- [13] based on sequence-to-sequence prediction has demonstrated exceptional performance. The transformer's structure foregoes the convolution operation in favor of a pure attention mechanism. Unlike CNN to obtain features, Transformer can obtain global context information through self-attention. An experiment [9] demonstrated that the Transformer network can achieve high performance in image tasks like image classification, image recognition, and semantic segmentation when a large-scale pre-training was conducted. In this article, we explored the application potential of Transformer for semantic segmentation in the context of remote sensing images. We tried various Transformer networks that had produced excellent results on public datasets by using remote sensing images and discovered surprisingly that they couldn't deliver sufficient results. This is because when an image is supplied to the Transformer network, the image patch gets compressed into a 1D sequence. Through self-attention, the Transformer network concentrates on the image's semantic global contextual information, and the network lost the image's spatial contextual information (location information) during computation. The spatial context information cannot be retrieved well by up-sampling in the encoder stage of the Transformer network, resulting in poor picture segmentation. We merged the feature maps of distinct stages to acquire the semantic context information and spatial context information of the image, inspired by the Unet [5].
To this end, we propose a new model for semantic segmentation of remote sensing images, named STransFuse, which is constructed by combining the Swin Transformer's architecture with CNN. Swin Transformer acquires features in the form of shifted windows to establish self-attention and uses CNN to acquire spatial context information. Transformer's success is predicated on extensive data training. However, the image dataset obtained in the field of remote sensing is limited, which severely limits Transformer's application in this field. Resnet34 with training weights is employed as the network backbone of the CNN branch and paired with Swin Transformer to acquire rich feature information of remote sensing images, as inspired by the article [9]. Our main contributions are as follows: 1) A model combining Swin Transformer and Resnet34 is intended to incorporate global semantic information retrieved by the Transformer network with the spatial contextual information extracted via Resnet34 from the images. The problem of degraded Transformer performance owing to tiny remote sensing datasets can be solved by using Resnet34 with pre-trained weights. 2) An AFM is created to adaptively fuse feature maps using the self-attention mechanism to improve the model's ability in expressing features. 3) On the Vaihingen and Potsdam datasets, the proposed STransFuse model performs better. The remainder of this article is structured in the following manner. The related work on semantic segmentation of remote sensing images is discussed in Section II and some Transformer researches is also reviewed in this Section. In Section III, the specifics of the STransFuse framework as well as the AFM design is explored. The datasets used in the studies, as well as the experimental parameters, are described in Section IV. Section V presents a complete ablation research and experimental comparison between the STransFuse model and some state-of-the-art models to validate the proposed module. The conclusion is given in Section VI.

A. Semantic Segmentation of Remote Sensing Images
Remote sensing images are widely used in many application fields, including crop yield estimation [14], military reconnaissance and natural disaster monitoring [15]. The accuracy of these applications is largely determined by the segmentation accuracy of remote sensing images. Traditional remote sensing image semantic segmentation relies on the texture information and spectral information of images, which requires a lot of manpower and material resources. The introduction of deep learning into remote sensing image segmentation has increased the accuracy, resulting in a significant increase in image segmentation efficiency. Lin et al. [16] created a scale-aware module to let the network distinguish different features using weighted feature maps. Chong et al. [17] proposed the Context Union Edge Network for semantic segmentation of remote sensing images, and a context-based feature augmentation module to improve CNN's capacity to differentiate small targets as well as a dual-stream network to refine small target edge information. Xiang et al. [18] created an adaptive feature selection module that learns the weight contribution of each feature block at different scales to improve the network's performance. Li et al. [19] introduced a semantic boundary aware network to collect correct boundary information for land cover categorization. The network is adaptable to obtain image boundary information using a bottom-up method and can reduce noise information from low-level characteristics. AFNet [20] employed the scale-feature attention module and scale-layer attention module to better tackle the difference between intra-class and inter-class in remote sensing images, and conducted adaptive feature improvement for targets of various sizes. Pan et al. [21] introduced a conditional generative adversarial network that actively generates new sample images while extracting advanced spatial information from previous training images. The network achieved greater classification accuracy using this strategy. To overcome the problem of cloud segmentation in remote sensing images and increase the network's feature extraction ability, Yao et al. [22] presented a multi-scale feature extraction and content-aware recombination network. The spatial relation module and the channel relation module proposed by Mou et al. [23] could learn and infer the relationship between any two geographical locations or feature maps to produce effective contextual spatial relation modeling.

B. Contextual Information
To increase the accuracy of image semantic segmentation, it's critical to understand how to properly extract the image's contextual information. FCNs [1] first widened the receptive field by pooling to capture the image's context information, but multiple downsampling processes resulted in the feature map losing certain details. Unet [5] created a network framework with an encoder-decoder structure that allows detailed information from low-level feature maps to be merged into highlevel feature maps by skipping network layers. The Feature Pyramid Transformer (FPT) [11] is a completely active feature interaction that extends the receptive field through the specified Transformer. Chen et al. [24] developed a tensor generation module to capture contextual data and offered a new way to modeling 3D context representations. The spatial relation module and the channel relation module were introduced by Mou et al. [25] to learn and infer the global link between any two spatial positions or feature maps, and then build a feature representation with improved relationship. In this article [26], an axial-attention model was presented to widen the receptive field in the model and alleviate the problem of losing remote context information in convolution. For remote sensing images, context information indicates the relationship between features. The difficulty of obtaining context information is due to the high resolution and imbalanced proportion of various characteristics reflecting various ground objects if compared with ordinary images. Generally it is difficult to analyze remote sensing images directly and handling remote sensing images frequently requires preprocessing (cropped, normalized). Some methods based on self-attention mechanisms generate excessive waste of computer resources in getting the context relationship when the processed images patch only contains one category of ground object. When the common model executes the convolution operation, the proportion of big scale features in the patch can be substantially higher than that of small scale features, causing small scale features to be heavily influenced by large scale features. A hot issue in the research of remote sensing image processing is how to efficiently resolve intra-class and interclass disparities in remote sensing images while balancing the accuracy and efficiency of remote sensing image processing.

C. Transformer
The Transformer was originally used in the realm of Natural Language Processing (NLP) [8]. It is a deep neural network model that extracts intrinsic properties via the self-attention approach. The good experimental performance Transformer achieved in the field of NLP suggested that it may be applied to the field of image processing. The first Transformer model based on pure self-attention for image recognition, Vision Transformer (ViT) [9], has achieved outstanding results in image processing, but the model requires a large number of datasets for training, and the results obtained by applying the model directly to small or medium-sized datasets were not promising. A great number of researchers tried many ways to make the Transformer more successful in the field of computer vision, inspired by the construction of the visual Transformer model [27]- [32]. Semantic SEgmentation Transformer (SETR) [33] is a model for semantic segmentation that used Transformer as an encoder. A sophisticated segmentation model can be created by combining the pure Transformer encoder with some simple decoders. DEtection Transformer (DETR) [10] is a Transformer that was developed by Facebook AI researchers and applied to a vision model. It's the first target detection framework to successfully incorporate Transformer as a pipeline's core building block. In the areas of target identification and panorama segmentation, the DETR model performed well. The Transformer-in-Transformer (TNT) model [34] makes use of an inner Transformer block to extract the images patch's internal structure information, allowing the model to extract both global and local properties. The model performed well on the ImagesNet benchmark dataset and in various downstream tasks. The Shifted Windows Transformer (Swin Transformer) [35] is a hierarchical Transformer that, like CNN, is capable of increasing the perceptual field of nodes as the network layers deepens. The use of shifted windows allows self-attention to be computed in non-overlapping local windows, reducing the computational complexity that is quadratic of an increase in image size, and thus potentially lowering the hardware requirements for the studies which require dense pixel-level prediction (e.g., semantic segmentation). On datasets including ImageNet-1K andADE20K, the Swin Transformer achieved good results. The experimental results, however, are unsatisfactory when the model is applied to the field of remote sensing. It is because the dataset of remote sensing images is small, and the features of remote sensing images are quite different from those of ordinary images. As inspired by the ViT model [9], we combined the pre-trained Resnet34 as the CNN backbone with the Swin Transformer model to create a two-branch network model that can perform well on remote sensing images.

A. Overview
The input remote sensing image x ∈ R H×W ×C , where H represents the height of the image, W represents the width of the image, and C represents the number of channels of the image. We use Swin Transformer and Resnet34 to handle the images, fuse the feature maps at different stages, and finally restore the feature maps to their original size. In Paragraph B, we will introduce the overall structure of STransFuse. Then, the details of Swin Transformer given in Paragraph C. Finally, the AFM is described in Paragraph D.

B. STransFuse Overall Architecture
As shown in Fig. 1(a), the image x is input into the Swin Transformer network and the Resnet34 network respectively. There are 4 stages in Swin Transformer network to get x s1 ,x s2 ,x s3 ,x s4 feature maps respectively, and each stage contains Patch Merging and Swin Transformer. Patch Merging works in a similar way to CNN's pooling layer in that it downsamples the image. By shifting the input image's window, this module separates the image into non-overlapping patches. Each patch is considered as a "token". We initially fixed the patch size to 4×4. Then, the eigenvalues in the feature map are projected to the C dimension through a linear embedding layer. Finally, Swin Transformer block is applied to these patch tokens ( H 4 × W 4 ). These steps above are collectively referred to as "Stage 1". In the following "Stage 2", Patch Merging concatenates the features of each group of 2 × 2 neighboring patches, and applies linear embedding layer to change the output dimension to 2C, and applies Swin Transformer for feature transformation. In "Stage2", The resolution of the patch is maintained at H 8 × W 8 . "Stage 3" and "Stage 4" are similar to "Stage 2", and the output patch resolutions are  Finally, the fused feature map is upsampled twice more, and the feature map is returned to its original size.

C. Swin Transformer Block
Swin Transformer uses the feature map enters the Window Multi-head Self Attention (W-MSA) to replace the Multi-head Self Attention (MSA) in the Transformer module. As shown in Fig. 1(b), Swin Transformer inputs the feature map processed by Patching Merging into the Swin Transformer block. Then, the feature map enters the W-MSA module through the LayerNorm layer, and there is a residual connection between each module and another LayerNorm layer.
The self-attention used in the standard Transformer block is calculated by relating one of the tokens to all other tokens. This calculation makes the computation workload of the network grow quadratically with respect to the resolution size of the image, and for some intensive prediction tasks (e.g., semantic segmentation), the model will require high-end computing devices. The Swin Transformer will perform the self-attentive computation in a local window. Each window will be evenly split into M×M patches in a non-overlapping manner. In this case, the computational complexity of MSA is shown in (1), and the computational complexity of W-MSA is shown in (2).
Where h and w are the height and width of the image, respectively. In (1), the computational complexity of MSA is quadratic to the production of h and w. In (2), when M is a fixed size (set to 7 by default), the computational complexity of W-MSA is linearly related to the production of h and w. Compared with W-MSA, Shifted Window Multi-head Self Attention (SW-MSA) shifts the window. Because of the sliding window segmentation operation performed by W-MSA, the cropped patches do not overlap, and there is no correlation between the windows which limits the performance of Swin Transformer. Therefore, in order to realize the cross-window connection of the model, article [32] introduced SW-MSA in the model.
In general, the calculation process of the feature map in Swin Transformer block is shown in (3)-(6): x l = MLP(LN ( x l )) + x l (4) x l+1 = SW-MSA(LN (x l )) + x l (5) Wherex l denote the output characteristics of the W-MSA module of l block,x l+1 denote the output characteristics of the SW-MSA module of l+1 block, and x l denote the MLP module of l block. W-MSA denote based multi-head self-attention using regular window partitioning configurations. LN denote Layer Normalization. MLP denote Multi Layer Perceptron. SW-MSA denote based multi-head self-attention using shift window partitioning configurations.

D. AFM
To efficiently fuse the encoded features from CNN and Swin Transformer, we designed an AFM based on the selfattentive mechanism, whose structure is shown in Fig. 2. We will perform the fusion of features with the following : x cs,i = Re LU (Conv(Interpolate(concat(x s,i , x c,i )))) (7) x BN,i−1 = Re LU (BN (Conv(Concat(x cs,i , x s,i−1 )))) (8) x k = Linear(Concat(AdaptiveAvgPool2d(x BN,i−1 ))) (10) x v = Linear(Concat(AdaptiveAvgPool2d(x BN,i−1 ))) (11) Among them, x s,i represents the feature matrix output by the i-th stage of Swin Transformer, and x c,i represents the feature matrix output by the i-th layer of CNN. x q is the query in self-attention calculation, x k represents the key in selfattention calculation, x v represents the value in self-attention calculation, x q x v gets the self-attention weight matrix, (x q x v ) x v obtains the weighted feature matrix, and add the weighted feature matrix and the fusion feature matrix to obtain x s,i−1 .

B. Evaluation Metric
We employed the data publisher's evaluation approach [36], which was also used in the articles [19], [21], [27], [37]. We used Intersection over Union (IoU) for each category, F1-score for each category, mean Intersection over Union (mIoU), mean F1-score, overall accuracy (OA) as our evaluation indicators. Because many indicators are based on confusion matrix for calculation, before introducing the specific formula of each indicator, the meaning of some symbols of the confusion matrix is defined as follows: True positive (TP), True negative (TN), False positive (FP) and False negative (FN). Therefore, the precision rate is calculated by using (13), and the recall rate is calculated using (14): The definition of OA is shown in Equation (15): The F1-score formula for each category is defined as shown in Equation (16):

F1=2
precision · recall precision + recall The mean F1-score is obtained by averaging the F1-score of each category. The higher the value of F1-score is, the better the experimental result is. The definition of IOU is shown in the following formula : where, N p represents the prediction set, and N gt represents the ground truth images. mIoU is generally calculated based on class. With the calculated IoU of each class, a global evaluation is obtained by using the average of the IoUs.

C. Training Configuration
All the experiments were implemented using PyTorch 1.4.0, Python3.7, CUDA 10.1 and CuDNN 7.6.5. The networks use the Adam optimizer, and the weight decay is 0.0002. We adopted "ploy" learning rate policy with a power of 0.9. The cross entropy loss with weight was defined as shown in (18): For all datasets, we set the size of batch size to 16 for all models, except for the TNT model and the Transunet model. Because the TNT model and the Transunet model are computationally expensive, in order to cater to our GPU memory size, we set the batch size of these two models to 12. All experiments were measured on a single 2080Ti with a memory size of 11G.

V. EXPERIMENTAL
We tested the effectiveness of the proposed module through an ablation studies. Then, we compared the proposed STrans-Fuse with some state-of-the-art methods and discussed the experimental results.

A. Ablation Studies
We conducted a series of ablation studies in the Vaihingen and Potsdam datasets to demonstrate the validity of the proposed model. We choose FCN [1] (Resnet34) as the baseline network for comparison. Table I summarized the Ablation results with different configuration of the network blocks. Among them, Swin xs4 represents that only Swin Transformer is used as the feature extractor, and only the feature map output by stage4 is input into the decoder. Swin uses concat for feature fusion of the feature maps output by all stages of Swin Transformer, and inputs the fused feature maps into the decoder. Res34+Swin represents that we use Resnet34 to process the input map, and then input the extracted feature map into Swin Transformer. Swin+Res34 is a two-branch network model built by fusing Swin Transformer network and Resnet34. This fusion model uses concat to fuse the feature maps of different phases of Swin Transformer network and Resnet34. STransFuse also uses a dual-branch structure as a feature extractor. At different stages of the feature map, we used our own AFM instead of the concat module to fuse the feature map.
It is seen from the Table I that the STransFuse model produced the best results. We first examined the impacts of the single-branch network model and the double-branch network model. Table I shows that single-branch network models (FCN, Swin xs4, Swin, Res34+Swin) perform worse for semantic segmentation of images than two-branch network models (Swin+FCN, STransFuse). At the same time, we compared the experimental results of using only the features (xs4) output from the final stage of SwinTransformer and fusing features of different stages (xs1, xs2, xs3, xs4). The testing results demonstrate that the Swin model, which combines the feature maps of several phases of the Swin Transformer, can enhance metric OA by 0.49 %. Then, we compared the experimental effects of connecting the Resnet34 network with pre-trained weights to the Transformer model in series (Res34+Swin) and in parallel (Swin+Res34, STransFuse). The parallel network model performs better in the experiments, as seen in Table I. Swin+Res34 is 1.19% better than Res34+Swin in terms of OA. Finally, we compared the AFM and concat modules' performance. As shown in Table I, the model employing our proposed AFM for feature map fusion performs 0.24 % better in OA than the model using concat for fusion. By comparing the Swin Transformer's trial findings, it is observed that our model can solve the problem of Swin Transformer's inability to distinguish small targets. This is because when the Transformer network computes the picture, it stretches the patch into a one-dimensional token. Under the influence of the surrounding large target pixels, the same pixel values of tiny targets will be separated into locations far apart, and the features of the pixels of small targets will appear less visible. The STransFuse model can learn features from both semantic and spatial context information, which helps to tackle the problem of Transformer's inability to learn small target features.
It can be seen clearly from Fig. 3 that the STransFuse model segmented better than the baseline network FCNs, and that the STransFuse model did not misclassify the features with shading effects in row (b). It is demonstrated that combining the feature maps of several stages of the swin Transformer is more effective than utilizing simply single stage feature maps when comparing Swin xs4 and Swin, and the twobranch network model has superior segmentation performance for buildings than that by the single-branch network model when comparing the visualization effect maps of Res34+Swin and Swin+Res34 in row (b).
We compared the recognition capabilities of the benchmark model FCNs and STransFuse models for different categories of the ground objects. We visualized the last Score layer in the FCNs model and STransFuse model. The highlighted area (red) in the figure depicts the network focusing on the area, whereas the dark (dark blue) area reflects the model not focusing on the area, as seen in Fig. 4. It is shown that the STransFuse model can better detect different sorts of targets in the Vaihingen dataset by comparing the CAM of FCNs and STransFuse. In the building column, our STransFuse model is able to have a more accurate classification of building. Because the features in the photographs are obtained from an above view, the height information of the features in the images is absent, resulting in the texture representation of the tops of building and impervious surface being comparable. Therefore, the FCNs model appeared the phenomenon of "car flying on the roof" in the recognition image. However, due to the use of self attention, STransFuse model modelled the long-range semantic correlation and determined the category information of similar semantics. Therefore, the STransFuse model can recognize semantic information better. In the column where the category car is located, the FCNs did not identify all car features and were not accurate enough in the already identified car boundary information, compared to the STransFuse model which is also good at identifying features with small targets like car. In the column where impervious surface is located, it is shown that FCNs recognized some car's semantic information as impervious surface. This is because car occupies a smaller proportion of the image compared to impervious surface, and impervious surfaces enclose the car. There is no correlation between car and car. This is a common inter-class imbalance in remote sensing photos, which occurs because remote sensing photographs often span a wide range of locations, and larger objects can fill a larger proportion of the image, whereas smaller-scale elements can  only occupy a smaller number of pixels. FCNs rely on a fixedsize convolution kernel to obtain features. Therefore, when extracting such small-scale features, they are easily affected by the surrounding feature categories. The Transformer branch we use can effectively solve this type of problem. Low vegetation and tree can be found that the two features have similar feature information through images. In the absence of image height information, it is easy to cause misclassification. Compared to FCNs, the STransFuse model has a better distinction between two different features.  Table II shows the results of the comparative experiments. All convolutional network models use Resnet34's pre-training weights. It can be seen from the table that the STransFuse model can achieve the best results. Although Deeplabv3+ [38] produced impressive results, the network uses a lot of GPU memory during training due to the ASPP of the model architecture, and Deeplabv3+ has the longest training time of all the comparable experimental models, as seen in the Fig. 10. The Deeplabv3+ model's overall efficiency is low. Scale-Aware Network (SANet) [16] is a network model that uses the same dataset. This model designed a re-sampling module that implicitly introduced spatial attention through re-sampling feature maps. Through experimental comparison, the model we designed is better than SANet. The mean F1-score was increased by 5.32%. BoTNet [39] replaced the last three bottleneck blocks in Resnet with a global attention module, and was implicitly regarded as multi-head attention through the author's model design. On the Vaihingen dataset, this model performed reasonably well. Due to the limited amount of remote sensing image data, the TNT model [34] has poor experimental results.

B. Evaluation and Comparisons on the Vaihingen Dataset
The Transformer was used as an encoder in the Transunet model [40] to present modeled remote dependencies and to add low-level detail information to the feature maps in the decoder via skip connections. However, due to the design of the encoder and the skip connection, Transunet model has higher requirements for hardware equipment. Comparing the experimental results in testing CNN-based improved models (FCN, Deeplabv3+, Unet, SANet, PSPNet) and Transformerbased improved models (BoTNet, SETR PUP , TNT, Transunet), the STransFuse model achieved better performance.
On the Vaihingen dataset, the qualitative comparison results are displayed in Fig. 6. As shown in Fig. 6, the STransFuse is capable of recognizing a variety of target categories. It benefits from the Transformer network's capabilities, such as improved global context modeling efficiency without sacrificing lowlevel detailed localization capability, and the ability to more precisely recognize small-scale objects (car). The car is anticipated as a fuzzy area in comparison to other networks, and the car boundary cannot be reliably determined. Furthermore, we discovered that the model with a pure Transformer encoder (SETR PUP, TNT) correctly recognizes a building as an impervious surface, but they mistakenly recognized Building as Impermeable Surface. The reason for this phenomenon may be that the two categories of features, building and permeable surface, have similar characteristics. Transformer has similar building and impermeable surface feature values when stretching the patch into a 1D token. When calculating the similarity, the self-attention judges the two types of features as the same type. When the characteristics are very different, the Transformer can distinguish them.
The results of testing the original image on the Vaihingen dataset are shown in Fig. 7. The STransFuse model can better identify large-scale buildings and accurately identify the boundaries of various features. Table III shows that the STransFuse model is able to get the highest overall accuracy score on the Potsdam Dataset. About the indicators mean Intersection over Union and mean F1-score, the STransFuse achieved the second highest score.

C. Evaluation and Comparisons on the Potsdam Dataset
The results of the qualitative comparison of the models are displayed in Fig. 8, where it can be shown that the STransFuse model performed well for various sizes of features. The STransFuse model determine more precisely the borders of features of small-scale car. It discriminated trees from low vegetation better than previous models. It also reliably determined the boundaries of buildings with huge dimensions. As a result, the STransFuse model is able to recognize multiscale remote sensing images with high accuracy. Fig. 9 shows the results of testing the original images from the Potsdam dataset. The STransFuse model showed stronger capability in feature identification at various scales and also in feature boundary identification. Fig. 10 (a) shows a comparative plot of the efficiency of the different models for the Vaihingen data.It can be seen that  TABLE II  THE COMPARISON OF STRANSFUSE WITH SOME STATE-OF-THE-ART MODELS USING VAIHINGEN DATASET. THE VALUE IN BOLD IS THE BEST, AND THE  UNDERLINED VALUE IS THE SECOND     the STransFuse model improved OA with a small increase in training time. Because of the ASPP module, the Deeplabv3+ model takes longer time in training and is thus less efficient. It is also seen from Fig. 10 that when applied directly to the semantic segmentation of remote sensing images, the performance of the model based on improved Transformer model is low. The experimental efficiency of different models on Potsdam is shown in Fig. 10 (b), and it can be observed from (b) that the STransFuse model reached a better OA in a shorter period of time.

VI. CONCLUSION
In this article, the STransFuse model, a combination of a Fusing Swin Transformer and CNN network, is proposed. This two-branch model takes the advantages of both Transformer network and CNN. Transformer can model the global correlation of input patches. CNN with pre-training weights can make up for the shortcomings of fewer remote sensing image training sets. The proposed model structure can make full use of feature information and detail information in all stages of feature maps to generate outstanding feature representation in the model. In addition, we created an AFM that can adaptively fuse features from the Transformer and CNN networks, resulting in a feature map input to the decoder that incorporates rich semantic and spatial context information. In comparison with some state-of-the-art models, the result from STransFuse model using the Vaihingen and Potsdam datasets is competitive. In the future, we will continue to research Transformer's applicability in the realm of remote sensing image processing to explore its potential.