loading page

Visual Question Answering Through Adversarial Learning of Multi-modal Representation
  • +1
  • Iqbal Chowdhury ,
  • Kien Nguyen Thanh ,
  • Clinton fookes ,
  • Sridha Sridharan
Iqbal Chowdhury
Queensland University of Technology

Corresponding Author:[email protected]

Author Profile
Kien Nguyen Thanh
Author Profile
Clinton fookes
Author Profile
Sridha Sridharan
Author Profile


Solving the Visual Question Answering (VQA) task is a step towards achieving human-like reasoning capability of the machines. This paper proposes an approach to learn multimodal feature representation with adversarial training. The purpose of the adversarial training allows the model to learn from standard fusion methods in an unsupervised manner. The discriminator model is equipped with a siamese combinatin of two standard fusion method namely multimodal compact bilinear pooling and multimodal tucker fusion. Output multimodal feature representation from generator is a resultant of graph convolutional operation. The resultant multimodal representation of the adversarial training allows the proposed model to infer the correct answers from open-ended natural language questions from the VQA 2.0 dataset. An overall accuracy of 69.86\% demonstrates the accuracy of the proposed model.