Deep Emotion Recognition in Dynamic Data using Facial, Speech and Textual Cues: A Survey

With the development of social media and human-computer interaction, video has become one of the most common data formats. As a research hotspot, emotion recognition system is essential to serve people by perceiving people’s emotional state in videos. In recent years, a large number of studies focus on tackling the issue of emotion recognition based on three most common modalities in videos, that is, face, speech and text. The focus of this paper is to sort out the relevant studies of emotion recognition using facial, speech and textual cues due to the lack of review papers concentrating on the three modalities. On the other hand, because of the effective leverage of deep learning techniques to learn latent representation for emotion recognition, this paper focuses on the emotion recognition method based on deep learning techniques. In this paper, we ﬁrstly introduce widely accepted emotion models for the purpose of interpreting the deﬁnition of emotion. Then we introduce the state-of-the-art for emotion recognition based on unimodality including facial expression recognition, speech emotion recognition and textual emotion recognition. For multimodal emotion recognition, we summarize the feature-level and decision-level fusion methods in detail. In addition, the description of relevant benchmark datasets, the deﬁnition of metrics and the performance of the state-of-the-art in recent years are also outlined for the convenience of readers to ﬁnd out the current research progress. Ultimately, we explore some potential research challenges and opportunities to give researchers reference for the enrichment of emotion recognition-related researches.


I. INTRODUCTION
T HE past decade has seen the rapid development of emotion recognition in many human-machine interaction and social media applications. Facing external stimulus, human nervous system generates corresponding subjective attitude and expresses emotions via multiple accesses, including face, voice, speech, gait, gesture, and physiological signals such as electroencephalogram(EEG), electrocardiogram(ECG) etc. Emotions play a crucial rule in life, and significantly effect the society behavior and decision-making of human-beings. With regard to social media, a large number of videos are generated by users from all over the world, and uploaded to internet publicly with the characteristics that face, speech and text are the most common modalities. Consequently, it is necessary to develop the research of emotion recognition from these modalities. From the perspective of modalities, researches are divided into unimodal emotion recogition and multimodal emotion recognition. Early researches focus on unimodal emotion recognition such as facial expression recognition (FER), speech emotion recognition (SER) and textual emotion recognition (TER), which attempt to learning emotional features from face, vocals and words of humans, respectively. Some studies also seem other modality as auxiliary to improve the performance of emotion recognition in primary modality during training [1] [2]. Recently, multimodal emotion recognition has gradually been explored and exploited as the complementary among several modalities can significantly improve the accuracy of emotion recognition. Meanwhile, there are also some efforts to alleviate the intersubject variations caused by the human attributes such as age, gender and ethnic etc [3]. From another perspective, emotion T. Zhang  recognitions in conversations(ERC) and non-conversations are distinguished. Considering a scene that intelligent robots in customer systems are demanded to recognize emotions of customer after customer speaks in a dialogue, two or more parties exist in the dialogue and a party can be influenced by either its own state in the past or the states of other parties. As a result, a series of novel methods are proposed for ERC [4][5] [6] [7][8] [9].
Various surveys for emotion recognition have been published in recent years [10] [11] [12][13] [14][15] [16][17] [18]. Li et al. [11] investigate the state-of-the-art for both static and dynamic FERs, and detail the pipelines of FERs in terms of datasets, preprocessing, hand-crafted features, deep learning embedding and comparison of performances, etc. For SERs, [13] [17] also survey fundamental informations of pipelines that are similar with FERs. For TERs, Alswaidan et al. [14] survey the state-of-the-art approaches for TERs but lack an indepth overview of deep learning-based method for TERs. Deng et al. [10] provide a thorough survey, which systematically review deep learning-based methods for TERs in terms of word embedding, deep learning architecture, training-level approaches and challenges in detail. In addition, several surveys for multimodal emotion recognition are proposed [15][18] [19]. Mello et al. [19] systematically analysis the meta factors that influence the performance of multimodal emotion recognition. Poria et al. [18] review both emotion recognition and sentiment analysis from unimodality to multimodality. Aside from visual, speech and textual modalities, Jiang et al. [15] consider the real-time mental health monitoring system and take the physiological signal(e.g. EEG, ECG) into consideration.
To our best knowledge, there is a lack of comprehensive surveys about multimodal emotion recognition in recent two years. Meanwhile, the survey focusing on facial, speech and textual modalities that are most common in social media is desperately needed with the rapid development of social network. The purpose of this paper is to fill this gap, which describes the techniques developed in recent years in both unimodality and multimodality. Our contributions are summarized as follows: 1) We propose a systematical review including definition of emotions, research progress of emotion recognition based on deep learning techniques, benchmark datasets, metrics, performances and challenges in future.
2) We conduct a comprehensive introduction of pipelines for unimodal emotion recognition, which includes the techniques of preprocessing, extracting hand-crafted feature and deep feature learning for emotion recognition. For multimodal emotion recognition, we also introduce current research status with respect to feature-level and decision-level fusion.
3) We outline datasets in several modalities as sufficiently as possible, and introduce the attributes of datasets from several perspectives(e.g. modality, year, sample size, the number of subjects, label type, context, language and access). Aside from this, we summarize the performance of methods for emotion recognition on these datasets in recent years and make a clear comparison. 4) We conduct further investigations about existing challenges and opportunities on the task of multimodal emotion recognition. As a result, some of valuable research challenges and directions are discussed for further research.
The rest of this paper is organized as follows: Section 2 introduces the definition of emotion. Section 3 describes traditional techniques of preprocessing and hand-crafted feature extraction, and the state-of-the-art for emotion recognition in deep learning methods. Section 4 exhaustively introduces dynamic datasets in several dimensions, and summarizes the performance and comparison of methods proposed in recent years on these datasets. Existing challenges and opportunities are discussed in section 5. Finally, section 6 concludes the paper.

II. DEFINITION OF EMOTION
Researchers gravitate to the task of emotion recognition over the past two decades owing to the significance of emotion in human nature. To better define and compare the representation of emotion, two kinds of models are proposed: discrete and continuous representation models, which are used commonly in recent researches. Meanwhile, both of discrete and continuous models has obvious advantage and disadvantage. Discrete model limits the representation of emotion in a fixed set, which brings interpretation comprehended intuitively by users, but leads to significant bias while raters annotate erroneously. Continuous model has numerical annotations that are suitable for modelling, but nonintuitive for users. The selection between two kinds of models depends on the actual needs.
Discrete representations: Discrete representations of emotions are widely used due to the intuition and simplicity, which means we can easily have great empathy with the emotion representations. Based on the hypothesis that emotions can be divided into multiple categories, the well-known categories of emotions are proposed by EKman et al. [20], where emotions are separated into six categories: Happiness, Sadness, Fear, Anger, Disgust and Surprise. Furthermore, Plutchik [21] proposes a emotional wheel shown as Fig. 1(a), which takes the correlation of emotions into consideration and deploys eight emotion categories into four bipolar axes: Joy-Sadness, Fear-Anger, Trust-Digust and Surprise-Anticipation. Each of complex emotions can be represented by the four bipolar axes with different intentions. Note that, one or more discrete emotions mentioned above can occcur simultaneously in a period of dynamic data(e.g. an utterance). Consequently, multilabel emotion recognition are essential when the annotation of samples are discrete.
Continuous representations: Aside from discrete representations, emotions can be represented via several numerical dimensions. There are mainly five core dimensions utilized to represent emotions: Pleasure/Valence, Arousal/Activation, Dominance/Power, Anticipation/Expectation and Intensity. Pleasure refers to the degree of positive or negative the expression of person seems. Arousal represents how dynamic or lethargic a person performs. Dominance represents the degree a person feels in control. Strictly speaking, Dominance contains two related concepts: power and control, where power is mainly about internal resources, and control is about the relationship between resources and external factors. There is usually a composite consideration between two concepts when the annotation of Dominance is processed. Anticipation also subsumes two concepts: expectation and anticipation. Raters take the balance of them into consideration. Intensity is how far a person behaves from the pure retionality. Typically, a person who behaves unemotional is rated as a low score in the dimension of Intensity. Ideally, emotion can be represented to a dimensional vector by these numerical dimensions accurately. In practice, the combination of two or three dimensions are adopted. The most popular emotion representation models are Circumplex model [22] and PAD model [23]. Circumplex model adapts two dimensions to represent emotion shown as Fig. 1(b): Pleasure and Arousal, which are deemed adequate to represent the most different emotions. Aside from Pleasure and Arousal, PAD model takes also Dominance into consideration.

STATE-OF-THE-ART
In this section, we introduce the state-of-the-art methods for emotion recognition in dynamic data using facial, speech, text, and multimodality, respectively. Specifically, we firstly describe the proprecessing techniques and deep feature learning methods that are widely used and achieve good performances in recent years for FERs in dynamic data. Then we introduce SERs in terms of preprocessing, handcrafted feature extraction and deep feature learning. Next, we introduce word embedding methods and deep feature learning methods when text modality is utilized to train model for emotion recognition. Finally, we summarize fusion strategies for emotion recognition in feature-level and decision-level, respectively.
A. Face

1) Preprocess
The interferences in vision mainly include the noisy information of complex background, variance of illuminations and head pose in unconstrained scenarios [11]. To overcome the interferences of complex background, the usual practice is to detect face region in each frame of video via face detectors, such as Viola-Jones [24]. For the better recognition of emotion, the coordinates of localized landmarks are aligned with face region as the input of training model [25], [26], [27], [28], [29], [30], [31]. Illumination have a adverse effect on the issue of emotion recognition. Thus one of preprocessing steps is to balance the light of face via a series of techniques, e.g. Histogram equalization [32], discrete cosine transform [33] [34], isotropic and anisotropic diffusion [35], difference of Gaussian [36] and homomorphic filtering [37]. There exists a series of pose normalization techniques [38][39] [40] that yield frontal facial views to overcome head pose problem. It is worth mentioning that pose normalization is essential while emotion recognition is processed in static data(e.g. image) but dynamic data(e.g. video), as the information of emotions can be delivered via head pose. In addition, data augmentation(e.g. cropping, flipping, rotation, shifting, skew, scaling, noise, contrast and color jittering) is overused when static data is preprocessed to overcome the problem of overfitting. When dealing with dynamic data, these data augmentation techniques are also optional to be used in each frame of dynamic data.
2) Deep Feature Learning CNN-RNN: Frame-level feature learning refers that latent spatial features of emotions are extracted in each selected frame, and then all of the spatial features are aggregated or treated as input of another module for temporal learning.  [41] propose a structure that integrates VGG-FACE [49] with GRU [50] to extract temporal and temporal features respectively, where VGG-FACE has been pre-trained with a large dataset for face recognition and achieve excellent performance. In addition, the combination of VGG-FACE and LSTM [51] is also adopted in [42] [43]. CNN-RNN structure is more suitable for the feature extraction of macro-expression (longer facial expressions, roughly 24-60 frames) as the features fed into RNN are abstract and global in higher layers of CNN. 3D-CNN: In recent years, 3D-CNN, as a type of spatiotemporal learning module, has been widely adopted for emotion recognition in videos. Instead of extracting relevant features from each image frame, 3D-CNN directly extracts spatio-temporal features, named C3D, from video via 3D convolutional kernels. 3D-CNN is capable of capture both macro-and micro-expression(roughly 2-10 frames) as 3D convolution is achieved by convolving a 3D kernel to the cube formed by stacking multiple contiguous frames together which starts from the lowest layer. Specifically, Pre-trained 3D-CNN for Human Action is adopted to extract spatiotemporal features of emotions in videos [52] [53]. Furthermore, Pre-trained 3D-CNN for sports are also applied in a series of studies [ [57]. 3D-CNN is gradually being well received when emotional features in videos are required to extract.
Two-Stream Network: Compared with CNN-RNN and 3D-CNN, Two-Stream Network is a novel and less well studied architecture for emotion recognition in videos. It is mainly composed of two parallel convolutional networks: a spatial network and a temporal network, which process a static image and instantaneous motion information(e.g. optical flow), respectively. Deng et al. propose a Two-Stream Network named MIMAMO [58], which consists of two stages: Two-Stream convolutional Neural Network and Gated Recurrent Unit Network to capture both micro-and macro-motion, respectively. The feature representation of a snippet, i.e. an RGB image and a sequence of images centered in time around the RGB image, is extracted via the Two-Stream convolutional Neural Network. Specifically, In the temporal network of MIMAMO, Complex Steerable Pyramid [59] is applied to obtain the phase difference between two consecutive facial frames to replace optical flow. These hand-crafted features are fed into CNN to extract latent features. Similarly, the pretrained ResNet50 [60] pretrained on VGGFace2 face recognition dataset [61] is utilized to extract features from the centering image in the spatial network of MIMAMO. MIMAMO is shown as Fig. 2. Moreover, Pan et al. [62] use two CNNs to extract features from a RGB image and optical flow features in the Two-Stream Network. Feng et al. [63] fed a RGB frame and the hand-crafted feature, i.e. LBP-TOP along the x-t and y-t axes, into two CNNs.

1) Preprocess
The first step after collecting data is to improve the quality of data via using some of preprocessing techniques for more accurate SER. We describe preprocessing in terms of preemphasis, framing, windowing and Voice activity detection.
Preemphasis: the average power spectrum of speech signal is influenced by glottic, nose and mouth, which cause the attenuation of power in high-frequency band. The purpose of preemphasis is to compensate high-frequency energy to make flat spectrum over the whole frequency band. Typically, The ztransfer function of high-pass filter for preemphasis is defined as where µ ∈ [0.9, 1.0] is the preemphasis coefficient.
Framing: framing is to partition speech signal into fixed length segments because of the short-time stationary of speech, which means that speech remains invariant for a sufficiently short period. Besides, there is usually 30% to 50% overlap between two adjacent frames to preserve the information of inter-frames. Consequently, task-related features can be extracted from these quasi-stationary frames for SER.
Windowing: the step of framing is processing via a windowing function timing speech signal. Hence, a suitable windowing function need to be allocated. The most overused windowing function are Hamming window ,which is defined as where M represents the window size and w(n) represents a frame. Hamming window has the ability to alleviate the effects of leakages that occurs during Fourier Transform caused by discontinuities at the edge of signals. There are also other optional windowing functions, e.g., rectangular window, hanning window, etc. Voice endpoint detection: voice endpoint detection refers to distinguish voice data from unvoice data and noise. Voice data is generated with the vibration of vocal folds that creates periodic excitation to the vocal tract during the pronunciation of phonemes. When air passes through a contriction in the vocal tract, aperiodic excitations are established, which induce transient and turbulent noises, i.e., unvoice data. Typically, zero crossing rate [64], as the rate that the sign of signal changes within a time frame, is a frequently-used method to detect voice endpoint.
2) Hand-crafted Feature Prosodic features: prosodic features have the ability to deliver the significantly distinctive properties of emotions for SER. When people have different emotions, they will have different intonation and rhythm. In order to represent emotions, fundamental frequency(F 0 ), energy and duration are widely used in general. Specially, the change of F 0 is an essential feature to represent rhythmic and tonal characteristics. For instance, F 0 contour usually increases when an emotion of joy is expressed. Other statistics of F 0 , such as the mean of F 0 , can also be utilized to represent emotions. Energy, also known as intensity, reflects the amplitude of speech. High arousal emotion is generally accompanied by increased energy, while low arousal emotion is opposite [65]. Emotion status is also represented via the duration-related features, such as the duration of voice, unvoice and silence.  Voice quality features: Harmonics-to-Noise ratio(HNR), jitter and shimmer are mostly representable features of voice quality. HNR is the ratio of harmonic components to noise components in speech. Note that the noise here is not ambient noise, but glottic noise caused by incomplete glottic closure. Jitter is a physical properties that describes the change of basic frequency of voice between adjacent vibratory cycles. It mainly reflects the degree of roughness and hoarseness of voice. Shimmer measures the change of amplitude of voice between adjacent vibratory cycles, mainly reflecting the degree of hoarseness.
Spectral features: The representation of shape of voice trace is processed in some of sound-related tasks( e.g. automatic speech recognition). The most commonly used representation is spectrum, which transforms speech signal from time domain into frequency domain. Mel Frequency Cepstral Coefficients (MFCC) is the most frequently-used feature for SERs, which represents shot-term power spectrum of signal. There are a series of spectral features that can be utilized for SERs. Different from MFCC, Gammatone Frequency Cepstral Coefficients (GFCC) applies Gammatone filter-bank to the power spectrum. Besides, Linear Prediction Cepstral Coefficients(LPCC), Log-Frequency Power Coefficients (LFPC), etc are also considerable.
3) Deep feature Learning In order to learn emotional representation accurately in speech, deep learning techniques are widely used in a speech emotion recognition system. Typically, on the basis of handcrafted low-level descriptor, spatio-temporal learning modules are the most commonly used structures to extract deeper and high-level features in latent space to express emotions shown as Fig. 3, which have been proved to be significantly effective for SERs.
Various studies model speech signal in segment-level or chunk level to learn emotion representation, and then combine the output of subsequences in segment or chunk level into a sentence-level representation via various strategies(e.g. RNN, Attention) to predict emotions. Segment-level SERs set the step size of segments as a fixed parameter resulting in changeable number of segments in a utterance, of which the length is arbitrary. Zhang et al. [66] propose a structure that adapts LSTM model to capture utterance-level representation of emotion on the basis of segmeng-level CNN features. Mustaqeem et al. [67] propose CNN+Attention to extract statio-temporal features for SERs. Other studies also achieve satisfying performance for SERs [68][69] [70]. In addition, chunk-level SERs product a fixed number of segments regardless of the duration of speech signals by changing the step size of the chunks along with the duration of signals. Lin et al. [71] propose a flexible framework, which is capable of tackling several speech-based sequence-to-one tasks(e.g. SER, speaker recognition). They take advantage of varying step size of chunks to extract a fixed number of chunks with a fixed size from varied duration of speech signal without the preprocess such as cropping or zero padding, and the effectiveness of temporal aggregating models that are built via this framework is proved in terms of robustness, accuracy and computational efficiency.
C. Text 1) Word Embedding Typical word embedding: Traditional one-hot encoding of word has the disadvantage of high dimension and lack of contextual information. Word embedding models considering syntactic context aims to embed words into low-dimensional space to overcome the drawback of sparsity in traditional models, and has been widely used for natural language processing tasks(NLP). Typical embedding models include Word2Vec [72][73], GloVe [74], ELMo [75], BERT [76], etc., which are trained on a large number of unlabeled textual data. The former two are trained based on the hypothesis that co-occured words are similar in semantic criteria, and each word has a unique representation. Nevertheless, they are not capable of dealing with the problem brought by polysemy and antonym. In other words, a word may be represented in different context, and antonyms are universally close in embedding space [77]. In recent years, pre-trained language models such as ELMo and BERT are widely adopted in NLP tasks and significantly boost performances. These language models dynamically generate embedding word vectors according to the current context, which make up for the disadvantages existed in the former two models. Other models such as GPT, MASS [78], XLNet [79] are also proposed and focused.
Emotional word embedding: Aside from typical word embedding models, emotional word embedding models focusing on embedding words with emotional information are proposed and applied to emotion-related tasks such as emotion recognition and sentiment analysis. There are two kinds of emotional embedding models: word-level and sentence-level embedding models. Emo2Vec [80] is a word-level emotional representation trained with six different emotion-related tasks. DeepMoji [81] is a sentence-level emotional representation that is built by a biLSTM model with attention on a 1246 million tweet corpus, which contains abandant information of emojis(an expression of emotions). Winata et al. [82] analysis and compare both typical and emotional word embedding models in detail, and prove that DeepMoji outperforms other word embedding models by a large margin interpreted as the corpus utilized to train DeepMoji is compatible with emotionrelated tasks.
2) Deep Feature Learning One of the most common scenario of TERs is dialogue systems. In conversation, semantic information is an important expression of emotions. An appropriate semantic analysis benefits the prediction of emotions. Emotion recognition in conversation has become a hot topic in recent years, and there are tricky issues brought from conversation, that is the contextual dependency of dyadic or multi-party. To tackle this issue, recent studies focus on proposing a series of novel structures to learning contextual representation of emotions, and obtain satisfactory results [4][83] [84][85] [86]. Deng et al. [84] propose to integrate general sentence representation and emotional feature representation in sentence-level for further context-level learning. GRU network is utilized to learn contextual information from previous sentences. In addition, Emotion correlations are considered and an emotion correlation learner based on BiGRU is proposed to prediction multi-label emotion. Jiao et al. [85] utilize BiGRU to learn feature and contextual representation of historical utterances as memories, and propose an Attention GRU, of which the state is update by Attention, to predict emotions. Distinguishing from purely supervised learning from a large number of annotated data, Hazarika et al. [83] recognize emotion in conversation from another perspective that transfer the parameters of pre-trained dialogue model on multi-turn conversations to a conversational emotion classifier and achieve significant improvement in terms of performance and robustness. Shen et al. [4] propose an all-in-one XLNet model with enhanced memory to store longer historical context and dialog-aware self-attention to deal with the multi-party structures. How to effectively capture the contextual dependency produced by inter-and intra-party is still demanded to be further explored in future.

D. Multimodality
Emotions can be expressed via more than one modality of data in social life. One can perceive real-time emotion of another using the fusion of reference evidences from facial, speech and textual information, which are regarded as the effective and acceptable method to improve the acuracy of emotion recognition. Recent studies emphasis the utilization of multi-modal data and propose a series of novel structures and proves the advantages of fusion of multimodality. There are mainly two types of multimodal fusion methods: featurelevel and decision-level fusion.
1) Feature-level Fusion Concatenation: The most widely used strategy of fusion of multi-modalities is to directly concatenate emotion representation generated from each modality. A large number of studies propose novel fusion strategies to tackle multimodal data based on concatenation [ [92]. Typically, Liang et al. [89] take reconstruction loss and classification loss into consideration, where each modality is constructed via a pair of encoder and decoder, and all of latent representations of different modalities are concatenated as the input of emotion classifier. Hossain et al. [54] concatenate emotion representations extracted from speech and facial modalities as the input of extreme learning machines for emotion recognition. In [8], the concatenation of features of multi-modalities are deemed as the input of GRU to learn global state and personal state with respect to global and personal context in a conversation, respectively.
With the development of Attention [93], one of popular fusion strategy of emotion representations in feature-   level is to combine the primitive concatenation strategy with attention, and achieve promising performance for emotion recognition [94][95] [96]. Mittal et al. [95] integrate fusion layer into memory fusion network [97] to fuse input modalities and classify emotions. Tzirakis et al. [94] test a variety of attentionmechanism based methods, including simple concatenation, self-attention, hierarchical attention, residual self-attention and cross-modal hierarchical self-attention, and prove the effectiveness of cross-modal hierarchical self-attention for fusion of emotion representations. Lian et al. [96] utilize single-modal and cross-modal Transformer to extract context-independent utterance-level features, and then propose Audio-Text-Speaker Fusion component for multimodal fusion(ATS-Fusion) and a Multi-Head Attention based bi-directional GRU for contextual feature extraction, where the effectiveness of ATS-Fusion is verified by comparing with simple concatenation. Graph network: Graph neural network(GNN) have been widely applied in a variety of tasks, e.g. information diffusion analysis for online social networks, human action recognition. Recently, GNN has gradually display its capacity for the fusion and feature learning of multi-modalities. To our best knowledge, the latest study of multimodal fusion based on GNN is proposed by [87], which demonstrates the effectiveness of GNN for multi-modal multi-label emotion recognition. In [87], Zhang et al. propose a novel emotion learning structure based on GNN, named as Heterogeneous Hierarchical Message Passing Network(HHMPN). HHMPN is mainly composed of four components shown as Fig. 4(a). One of components is Feature-to-Feature Level, where each extracted feature of each modality as a node in the graph and messages are passed from a feature node to another one. There are other components , such as Feature-to-Label Level, Labelto-Label Level and Modality-to-Label Level, which work to pass messages among different type of nodes. Benssassi et al. [98] synthesis the representation of neural synchrony graph from facial and speech features that are extracted via spiking neural networks(SNN). The neural synchrony graph is deemed as the input of Graph Convolutional Network(GCN) for emotion recognition.
Some atypical fusion methods are also proposed [42][99]. Nie et al. [42] fuse speech and facial features by factorized bilinear pooling operation [100]. The process is defined as follows: speech and facial features are fed into the pipeline including fully connected layers, element-wise multiplication, dropout, sum-pooling, L2 normalization to generate highlevel descriptor. Wu et al. [99] propose two-stage fuzzy fusion strategy that combining the Canonical Correlation Analysis(CCA) and Fuzzy Broad Learning System(FBLS) [101] to deal with the imbalanced contributions of each modality and the correlation and difference between multi-modal features. More novel fusion methods are expected to be proposed in forthcoming future.
2) Decision-level Fusion Decision-level fusion allocates multiple models considering facial, speech and textual features for emotion recognition. The results generated by unimodal emotion recognition model are combined or aggregated for final prediction. This fusion method is lack of the consideration of correlation among modalities, but flexible to allocate the most suitable model for each modality.
There is relatively less studies of fusion in decision-level than that in feature-level. Xu et al. [88] propose a module named 3DCLS (3D Convolutional-Long Short Term Memory) hybrid model to recognize visual emotions and CNN-RNN hybrid model to recognize text-based emotions, which is shown as Fig. 4(b). In addition, SVM is utilized to recognize emotion from speech. classification probabilities of each emotion category scored from each classifier are weighted and summed to score final probability distribution of emotions. Dahmane et al. [43] apply sequential temporal CNN and LSTM with an Attention Weighted Average layer and Fisher-Vector encodingbased local and global descriptors to generate features from visual and acoustic modality respectively, and then late-fusion is made in decision-level. Multi-task CNN and SVM are utilized to extract features in speech and facial modality, and the fusion is processed via meta-classifier in [102].

A. Datasets
Sufficiently considered datasets are essential to comprehensively evaluate the performance of emotion recognition system with respect to multi-modalities. In this section, we introduce a series of available datasets, which are widely used for recent research in both unimodality and multimodality. A summary of datasets is shown in TABLE 1, where datasets are grouped by the categories of modalities and summarized in dimensions of year, utterance-level sample size, the number of subjects, label type, context, language and access. Here we do not cover image-related datasets, as the primary purpose of the paper is to inform readers on dynamic datasets such as videos.

B. Evaluation Metrics
Evaluation of models is essential to advance the progress of research. We introduce widely used evaluation metrics to  evaluate the performance of models for both categorical and continuous emotion recognition.

1) Categorical
For categorical emotion recognition, most of the state-ofthe-arts utilize Accuracy(or called Recall) [96] and F 1 score to evaluate the performance of models. Here we suppose there are C emotion classes in a dataset. N c represents the number of samples of class c, where c ∈ {1, 2, ..., C}. For class c, where T P c is the true positive of class c, F P c is the false positive of class c, T N c is the true negative of class c, F N c is the false negative of class c. Other metrics are defined as (6)(7)(8)(9).
• W eighted average accuracy(ACC): • U nweighted average accuracy(uACC): • W eighted average F 1:  6. Performances of models for emotion recognition on tri-modal dataset in recent three years.
• U nweighted average F 1(uF 1): where N c represents the number of samples of class c in dataset.

2) Continuous
Both the Pearson Correlation Coefficients(P CC) and Concordance Correlation Coefficient (CCC) are widely used to estimate the performance of emotion recognition with continuous annotations. We suppose that y i represents true value of sample i andŷ i represents predicted value of sample i. The definitions of P CC and CCC are shown as (10)(11).
• Pearson Correlation Coefficients: where σ 2 yŷ , σ y and σŷ represents covariance between y andŷ, standard deviation of y and standard deviation of y, respectively. • Concordance Correlation Coefficient: where σ 2 yŷ represents covariance between y andŷ, σ 2 y represents variance of y, σŷ represents variance ofŷ, µ y represents mean value of y and µŷ represents mean value ofŷ.

C. Performances and discussion
A large number of studies put emphases on the improvement of performance on the task of emotion recognition. TABLE 2 summarizes the state-of-the-art of emotion recognition in terms of modality, year of publication, benchmark datasets, metrics and performance. Fig. 5 illustrates the distribution of quantity of papers on each dataset. The statistical results show that IEMOCAP, MELD and OMG rank the top three in terms of frequency of use. Consequently, Fig. 6 displays the performances of models proposed in recent three years on three tri-modal datasets: IEMOCAP, MELD and OMG, where metric is plotted on the horizontal x-axis against modality is plotted on the vertical Y-axis for the convenience that readers can clearly see the research trend for emotion recognition on these datasets. Note that in IEMOCAP, 6 classes(anger, happiness, sadness, neutral, excitement and frustration) of discrete categories are considered in Fig. 6(a) for the convenience of comparison of state-of-the-art frameworks. Meanwhile, there are also some studies that consider 4 classes(anger, happiness, sadness, neutral) and merge happiness and excitement categories into the single happiness category to alleviate the variance caused by ambiguous annotations, which is shown as Fig. 6(b). It can be found from TABLE 2 that the most commonly used metrics for discrete emotion labels are ACC, uACC, F 1 and uF 1, and that for continuous emotion labels is CCC. However, for discrete emotion reccognition, some studies only use part of ACC, uACC, F 1 and uF 1, which makes it difficult for readers to directly compare performances of studies using different metrics. We expect that the future research will comprehensively apply the metrics to the experiment. We can also find that more emotion recognition methods based on multimodal fusion are proposed and achieve good performances. Meanwhile, the trend of research of emotion recognition based on unimodality is still positive, especially SERs and TERs are also research hotspots in recent years. Consequently, more innovative research is expected to be put forward.

A. Privacy enhancement
With the development of social media and artificial intelligence, many applications aim to recognize emotions of users. this procedure often needs users to transmit data to server, where the transmission is vulnerable to hacking and re-identification. Eavesdroppers can easily obtain sensitive information from eavesdropped data. An improved strategy that protects privacy is to transmit data representation generated on devices to server, and then the data representation is processed on server by presupposition mechanism for further tasks. Nevertheless, sensitive demographic informations are also leakaged through certain technology. Demographic information such as race, age and gender are significant in hiring, policing and credit ratings. Therefore, one primary task of privacy concerns is to eliminate demographic information that existed in the representations of multimodal data, while maintaining the performance on the task of emotion recognition. To our best knowledge, the first study on privacy enhanced emotion recognition in multimodal data is proposed by [90]. We expect more studies focusing on privacy issues occur in future.

B. Generalization and personalization
Recent studies focus on exploring more generalized models based on deep learning techniques, which are trained on a large number of in-the-wild datasets with emotion labels to overcome obstacles of the difference of emotion expression of each person, and the variety of understanding and display of emotion depending on different situation, interaction partner and even the time of the day [136][137] [138]. Nevertheless, they perform excellent on training datasets but poorly on realworld problems as the expression of a person maybe extremely personalized and not well represented in training datasets. For example, the expression of happiness of a person could seems like an expression of sadness sometimes. Thus, personalization of models are essential to adapt to new individuals. To overcome the poor performance produced by personalization of individuals, Transfer learning and lifelong learning techniques are utilized to address this problem. Cross-corpus emotion recognition [139][140] aims to improve performance of emotion recognition on target domain by transferring knowlegde from source domain with labelled samples to target domain with less or not labelled samples. These models present a significant improvement on target datasets but exist some limitations when applied to real-world scenarios due to the expensive and slow adaptation process. Lifelong learning is considered as an major breakthrough in the fight against the balance between generalization and personalization. Barros et al. [130] adapt the interplay between generalization and personalization by self-organizing mechanisms that interpret emotions from self-organized general emotion representation and personalized emotion recognition. Incoming samples are organized incrementally that means no retrain is required. Lifelong, unsupervised representation learning for both generalized and personalized emotion recognition in multimodality need to be further explored in future.

C. Unified model
A large number of studies concentrate on proposing novel models for emotion recognition in unimodality, bimodality and multimodality. Despite this, there is still a small amount of unified model simultaneously suitable for each modality and arbitrary combination of modality. The studies focusing on multimodal fusion experiment on multimodal data for the purpose of accurate improvement of emotion recognition, but are short of the empirical evidence to prove the effectiveness of models when some of modalities are unavailable. Mittal et al. [95] propose M3ER that utilizes Modality Check Step to replace unavailable modality with proxy feature and fuses multimodal features by multiplicative fusion module. M3ER is a promising technique but similarly lack of experiments in unimodality and bimodality. [94][128] evaluate their models in unimodality and multimodality, but lacks the experiments in bimodality. [6] propose a model for real-time emotion detection in conversations and experiment in bimodality and multimodality data but lack of the unimodality. [7] and [87] evaluate the performances of their models using different modality combinations but lack the comparison against baselines in unimodality and bimodality. Liang et al. [89] evaluate their model and empirical analysis is presented in unimodality, bimodality and mulmodality. In short, one of the challenge is to propose unified models with sufficient comparative experiments in unimodality, bimodality and mulmodality.

VI. CONCLUSION
This paper comprehensively reviews and summarizes the definition of emotion and the state-of-the-art of unimodal emotion recognition including facial expression recognition, speech emotion recognition and textual emotion recognition in dynamic data. In addition, this paper summarizes corresponding benchmark datasets, metrics and performances for clearly comprehending the development trend of research on the issue of emotion recognition. Ultimately, we present the latent research challenge and future direction to enrich the research in this field.