Abstract
Traditionally, speech emotion recognition (SER) research has relied on
manually handcrafted acoustic features using feature engineering.
However, the design of handcrafted features for complex SER tasks
requires significant manual effort, which impedes generalisability and
slows the pace of innovation. This has motivated the adoption of
representation learning techniques that can automatically learn an
intermediate representation of the input signal without any manual
feature engineering. Representation learning has led to improved SER
performance and enabled rapid innovation. Its effectiveness has further
increased with advances in deep learning (DL), which has facilitated
deep representation learning where hierarchical representations are
automatically learned in a data-driven manner. This paper presents the
first comprehensive survey on the important topic of deep representation
learning for SER. We highlight various techniques, related challenges
and identify important future areas of research. Our survey bridges the
gap in the literature since existing surveys either focus on SER with
hand-engineered features or representation learning in the general
setting without focusing on SER.