Abstract
In this research, we present a model architecture for employing
reinforcement learning with transformers for multimodal tasks. Using
transformers allows us to take advantage of the transformer
architecture’s simplicity and scalability, as well as developments in
language and vision modelling such as ViT, GPT-x, and BERT. Specifically
we have trained the model to recognize various digits from the MNIST
dataset from their associated word labels and instructions provided. The
approach is similar to how an infant would learn to associate pictorial
representation of digits to their corresponding words. We have used a
MoME transformer in conjunction with Deep Q Learning to train our model.
The image inputs have been embedded using pre-trained ResNet18 and the
instructions have been embedded using GLoVe before passing them to the
model for prediction and training.