Multimodal Semantic Consistency-Based Fusion Architecture Search for
Land Cover Classification
Abstract
Multimodal Land Cover Classification (MLCC) using the optical and
Synthetic Aperture Radar (SAR) modalities has resulted in outstanding
performances over using only unimodal data duo to their complementary
information on land properties. Previous multimodal deep learning (MDL)
methods have relied on handcrafted multi-branch convolutional neural
networks (CNN) to extract the features of different modalities and
merged them for land cover classification. However, natural
images-oriented handcrafted CNN models may not the optimal strategies to
handle Remote Sensing (RS) image interpretation problems, duo to the
huge difference in terms of imaging angles and imaging ways.
Furthermore, few MDL methods have analyzed optimal combinations of
hierarchical features from different modalities. In this article, we
propose an efficient multimodal architecture search framework, namely
Multimodal Semantic Consistency-Based Fusion Architecture Search
(M2SC-FAS) in continuous search space with the
gradient-based optimization method, which can not only discover optimal
optical- and SAR-specific architectures according to the different
characteristics of the optical and SAR images, respectively, but also
realizes the search of optimal multimodal dense fusion architecture.
Specifically, the semantic-consistency constraint is introduced to
guarantee dense fusion between hierarchical optical and SAR features
with high semantic consistency and then capture the complementary
performance on land properties. Finally, the basis of curriculum
learning strategy is adopted on the M2SC-FAS.
Extensive experiments show superior performances of our work on three
broad co-registered optical and SAR datasets.