JBHI_CMMO.pdf (1.82 MB)

Cross-Modal Self-Supervised Vision Language Pre-training with Multiple Objectives for Medical Visual Question Answering

Download (1.82 MB)
posted on 2023-08-01, 18:23 authored by Gang Liu, Pengfei LiPengfei Li, Zixu Zhao, Jinlong He, Genrong He, Shenjun Zhong

 Medical Visual Question Answering (VQA) is a task that aims to provide answers to questions about medical images, which utilizes both visual and textual information in the reason?ing process. The absence of large-scale annotated medical VQA datasets presents a formidable obstacle to training a medical VQA model from scratch in an end-to-end manner. Existing works have been using image captioning dataset in the pre-training stage and fine-tuning to downstream VQA tasks. Following the same paradigm, we use a collection of public medical image captioning datasets to pre-train multimodality models in a self-supervised setup, and fine-tune to downstream medical VQA tasks. In the work, we propose a method that featured with Cross-Modal pre?training with Multiple Objectives (CMMO), which includes masked image modelling, masked language modelling, image-text match?ing, and image-text contrastive learning. The proposed method is designed to associate the visual features of medical images with corresponding medical concepts in captions, for learning aligned vision and language feature representations, and multi-modal interactions. The experimental results reveal that our proposed CMMO method outperforms state-of-the-art methods on three pub?lic medical VQA datasets, showing absolute improvements of 2.6%, 0.9%, and 4.0% on the VQA-RAD, PathVQA, and SLAKE dataset, respectively. We also conduct comprehensive ablation studies to validate our method, and visualize the attention maps which show a strong interpretability. The code and pre-trained weights will be released at 


Email Address of Submitting Author

Submitting Author's Institution

College of Computer Science and Technology, Harbin Engineering University

Submitting Author's Country

  • China

Usage metrics