Deep Emotion Recognition in Dynamic Data using Facial, Speech and
Textual Cues: A Survey
Abstract
With the development of social media and human-computer interaction,
video has become one of the most common data formats. As a research
hotspot, emotion recognition system is essential to serve people by
perceiving people's emotional state in videos. In recent years, a large
number of studies focus on tackling the issue of emotion recognition
based on three most common modalities in videos, that is, face, speech
and text. The focus of this paper is to sort out the relevant studies of
emotion recognition using facial, speech and textual cues due to the
lack of review papers concentrating on the three modalities. On the
other hand, because of the effective leverage of deep learning
techniques to learn latent representation for emotion recognition, this
paper focuses on the emotion recognition method based on deep learning
techniques. In this paper, we firstly introduce widely accepted emotion
models for the purpose of interpreting the definition of emotion. Then
we introduce the state-of-the-art for emotion recognition based on
unimodality including facial expression recognition, speech emotion
recognition and textual emotion recognition. For multimodal emotion
recognition, we summarize the feature-level and decision-level fusion
methods in detail. In addition, the description of relevant benchmark
datasets, the definition of metrics and the performance of the
state-of-the-art in recent years are also outlined for the convenience
of readers to find out the current research progress. Ultimately, we
explore some potential research challenges and opportunities to give
researchers reference for the enrichment of emotion recognition-related
researches.