TechRxiv
manuscript_taffc.pdf (4.01 MB)
Download file

Deep Emotion Recognition using Facial, Speech and Textual Cues: A Survey

Download (4.01 MB)
preprint
posted on 21.10.2021, 02:28 authored by Tao ZhangTao Zhang, Zhenhua TanZhenhua Tan
With the development of social media and human-computer interaction, it is essential to serve people by perceiving people's emotional state in videos. In recent years, a large number of studies tackle the issue of emotion recognition based on three most common modalities in videos, that is, face, speech and text. The focus of this paper is to sort out the relevant studies of emotion recognition using facial, speech and textual cues based on deep learning techniques due to the lack of review papers concentrating on the three modalities. In this paper, we firstly introduce widely accepted emotion models for the purpose of interpreting the definition of emotion. Then we introduce the state-of-the-art for emotion recognition based on unimodality including facial expression recognition, speech emotion recognition and textual emotion recognition. For multimodal emotion recognition, we summarize the feature-level and decision-level fusion methods in detail. In addition, the description of relevant benchmark datasets, the definition of metrics and the performance of the state-of-the-art in recent years are also outlined for the convenience of readers to find out the current research progress. Ultimately, we explore some potential research challenges and opportunities to give researchers reference for the enrichment of emotion recognition-related researches.

Funding

National Key Research and Development Program of China under Grant No.2019YFB1405803

National Natural Science Foundation of China under Grants No. 61772125

History

Email Address of Submitting Author

zhangt1111@qq.com

Submitting Author's Institution

Software College, Northeastern University.

Submitting Author's Country

China

Usage metrics

Licence

Exports