Abstract
There are many task in surveillance monitoring such as object detection,
person identification, activity and action recognition etc. Integrating
variety of surveillance task through a multimodal interactive system
will benefit real-life deployment, and will also support human
operators. We first introduce a dataset which is first of its kind and
named as Surveillance Video Question Answering (SVideoQA) dataset. The
multi-camera surveillance monitoring aspect is considered through the
multimodal context of Video Question Answering (VideoQA) in the SVideoQA
dataset. This paper proposes a deep learning model where VideoQA task on
the SVideoQA dataset is attempted to solved in a manner where
memory-driven relationship among appearance and motion aspect of the
video features are captured. At each level of the relational reasoning
respective attentive parts of the context of the motion and appearance
features are identified forwarded through frame level and clip level
relational reasoning module. Also, respective memories are updated which
are again forwarded to the memory-relation module to finally predict the
answer word. The proposed memory-driven multilevel relational reasoning
is made compatible with the surveillance monitoring task through the
incorporation of multi-camera relation module, which is able to capture
and reason over the relationships among the video feeds across multiple
cameras. Experimental outcome exhibits that the proposed memory-driven
multilevel relational reasoning perform significantly better on the
open-ended VideoQA task compared to other state-of-the art systems. The
proposed method achieves an accuracy of 57\% and
57.6\% respectively for the single-camera and
multi-camera task of the SVideoQA dataset.