A Decoupling Paradigm with Prompt Learning for Remote Sensing Image
Change Captioning
Abstract
Remote sensing image change captioning (RSICC) is a novel task that aims
to describe the differences between bi-temporal images by natural
language. Previous methods ignore a significant specificity of the task:
the difficulty of RSICC is different for unchanged and changed image
pairs. They process the unchanged and changed image pairs in a coupled
way, which usually causes confusion for change captioning. In this
paper, we decouple the task into two issues to ease it: whether and what
changes have occurred. An image-level classifier performs binary
classification to address the first issue. A feature-level encoder
contributes to extracting discriminative features to help the caption
generation module address the second issue. For caption generation, we
utilize prompt learning to introduce pre-trained large language models
(LLMs) into the RSICC task. A multi-prompt learning strategy is proposed
to generate a set of unified prompts and a class-specific prompt
conditioned on the image-level classifier’s results. It can prompt a
pre-trained LLM to know whether changes exist and generate captions.
Finally, the multiple prompts and the features of the feature-level
encoder are fed into a frozen LLM for captioning. Compared with previous
methods, our method can leverage the powerful abilities of the
pre-trained LLM in language to generate plausible captions, which is
free of training. Extensive experiments show that our method is
effective and achieves state-of-the-art performance. Besides, an
additional experiment demonstrates that our decoupling paradigm is more
promising than the previous coupled paradigm for the RSICC task. We will
make our codebase publicly available to facilitate future research at
https://github.com/Chen-Yang-Liu/PromptCC