loading page

Uncertainty Modeling in Multimodal Understanding and Generation via Probability Distribution Encoders
  • Junjie Wang
Junjie Wang
Waseda University

Corresponding Author:[email protected]

Author Profile

Abstract

In the field of multimodal semantic understanding, tackling inherent uncertainties is essential for mitigating ambiguous interpretations across multiple targets. This paper introduces the Probability Distribution Encoder (PDE), a versatile, plug-and-play module that utilizes sequence-level and feature-level interactions to model these uncertainties as probabilistic distributions. We demonstrate its adaptability by seamlessly integrating PDE into established frameworks, culminating in models like SWINPDE. Compared to previous methods, our probabilistic approach substantially enriches multimodal semantic understanding. We incorporate this uncertainty modeling into prevalent pre-training architectures and propose specialized pre-training tasks: Distribution-based Vision-Language Contrastive Learning (D-VLC), Distribution-based Masked Language Modeling (D-MLM), and Distribution-based Image-Text Matching (D-ITM).
Empirical tests show that our models achieve state-of-the-art (SoTA) results in a range of downstream tasks, including image-text retrieval, visual question answering, visual reasoning, visual entailment and video captioning. Furthermore, our qualitative results reveal several superior properties conferred by this uncertainty modeling, such as enhanced semantic expressiveness compared to point representations, and the ability to generate diverse yet accurate predictions.