Visual Question Answering Through Adversarial Learning of Multi-modal
Representation
Abstract
Solving the Visual Question Answering (VQA) task is a step towards
achieving human-like reasoning capability of the machines. This paper
proposes an approach to learn multimodal feature representation with
adversarial training. The purpose of the adversarial training allows the
model to learn from standard fusion methods in an unsupervised manner.
The discriminator model is equipped with a siamese combinatin of two
standard fusion method namely multimodal compact bilinear pooling and
multimodal tucker fusion. Output multimodal feature representation from
generator is a resultant of graph convolutional operation. The resultant
multimodal representation of the adversarial training allows the
proposed model to infer the correct answers from open-ended natural
language questions from the VQA 2.0 dataset. An overall accuracy of
69.86\% demonstrates the accuracy of the proposed model.