TechRxiv
CVIU_techRxiv.pdf (1.07 MB)

Visual Question Answering Through Adversarial Learning of Multi-modal Representation

Download (1.07 MB)
preprint
posted on 31.07.2020 by Iqbal Chowdhury, Kien Nguyen Thanh, Clinton fookes, Sridha Sridharan
Solving the Visual Question Answering (VQA) task is a step towards achieving human-like reasoning capability of the machines. This paper proposes an approach to learn multimodal feature representation with adversarial training. The purpose of the adversarial training allows the model to learn from standard fusion methods in an unsupervised manner. The discriminator model is equipped with a siamese combinatin of two standard fusion method namely multimodal compact bilinear pooling and multimodal tucker fusion. Output multimodal feature representation from generator is a resultant of graph convolutional operation. The resultant multimodal representation of the adversarial training allows the proposed model to infer the correct answers from open-ended natural language questions from the VQA 2.0 dataset. An overall accuracy of 69.86\% demonstrates the accuracy of the proposed model.

History

Email Address of Submitting Author

n9597573@qut.edu.au

Submitting Author's Institution

Queensland University of Technology

Submitting Author's Country

Australia

Licence

Exports