Multimodal Semantic Consistency-Based Fusion Architecture Search for Land Cover Classification
Multimodal Land Cover Classification (MLCC) using the optical and Synthetic Aperture Radar (SAR) modalities has resulted in outstanding performances over using only unimodal data duo to their complementary information on land properties. Previous multimodal deep learning (MDL) methods have relied on handcrafted multi-branch convolutional neural networks (CNN) to extract the features of different modalities and merged them for land cover classification. However, natural images-oriented handcrafted CNN models may not the optimal strategies to handle Remote Sensing (RS) image interpretation problems, duo to the huge difference in terms of imaging angles and imaging ways. Furthermore, few MDL methods have analyzed optimal combinations of hierarchical features from different modalities. In this article, we propose an efficient multimodal architecture search framework, namely Multimodal Semantic Consistency-Based Fusion Architecture Search (M2SC-FAS) in continuous search space with the gradient-based optimization method, which can not only discover optimal optical- and SAR-specific architectures according to the different characteristics of the optical and SAR images, respectively, but also realizes the search of optimal multimodal dense fusion architecture. Specifically, the semantic-consistency constraint is introduced to guarantee dense fusion between hierarchical optical and SAR features with high semantic consistency and then capture the complementary performance on land properties. Finally, the basis of curriculum learning strategy is adopted on the M2SC-FAS. Extensive experiments show superior performances of our work on three broad co-registered optical and SAR datasets.