A Decoupling Paradigm with Prompt Learning for Remote Sensing Image Change Captioning
Remote sensing image change captioning (RSICC) is a novel task that aims to describe the differences between bi-temporal images by natural language. Previous methods ignore a significant specificity of the task: the difficulty of RSICC is different for unchanged and changed image pairs. They process the unchanged and changed image pairs in a coupled way, which usually causes confusion for change captioning. In this paper, we decouple the task into two issues to ease it: whether and what changes have occurred. An image-level classifier performs binary classification to address the first issue. A feature-level encoder contributes to extracting discriminative features to help the caption generation module address the second issue. For caption generation, we utilize prompt learning to introduce pre-trained large language models (LLMs) into the RSICC task. A multi-prompt learning strategy is proposed to generate a set of unified prompts and a class-specific prompt conditioned on the image-level classifier's results. It can prompt a pre-trained LLM to know whether changes exist and generate captions. Finally, the multiple prompts and the features of the feature-level encoder are fed into a frozen LLM for captioning. Compared with previous methods, our method can leverage the powerful abilities of the pre-trained LLM in language to generate plausible captions, which is free of training. Extensive experiments show that our method is effective and achieves state-of-the-art performance. Besides, an additional experiment demonstrates that our decoupling paradigm is more promising than the previous coupled paradigm for the RSICC task.
Email Address of Submitting Authorliuchenyang@buaa.edu.cn
ORCID of Submitting Author0000-0003-3034-6646
Submitting Author's Institutionthe Image Processing Center, School of Astronautics, Beihang University
Submitting Author's Country