loading page

Speech-driven Personalized Gesture Synthetics: Harnessing Automatic Fuzzy Feature Inference
  • +8
  • Fan Zhang ,
  • Zhaohan Wang ,
  • Xin Lyu ,
  • Siyuan Zhao ,
  • Mengjian Li ,
  • Weidong Geng ,
  • Naye Ji ,
  • Hui Du ,
  • Fuxing Gao ,
  • Hao Wu ,
  • Shunman Li
Fan Zhang
the Faculty of Humanities and Arts, the Faculty of Humanities and Arts

Corresponding Author:[email protected]

Author Profile
Zhaohan Wang
Author Profile
Siyuan Zhao
Author Profile
Mengjian Li
Author Profile
Weidong Geng
Author Profile
Fuxing Gao
Author Profile
Shunman Li
Author Profile


Speech-driven gesture generation is an emerging field within the domain of virtual human creation. The primary objective in this field is to attain authentic and personalized co-speech gestures while considering appropriate input conditions. However, a significant challenge lies in the difficulty of accurately determining the multitude of factors (such as acoustic, semantic, emotional, personality, and even subtle unknown features) inherent in these input conditions, which can be considered a complex class of fuzzy sets. Consequently, relying solely on explicit classification labels by manual annotation imposes limitations on the potential diversity of output states. To address these challenges, we introduce \textit{Persona-Gestor}, a novel approach integrating an automatic fuzzy feature inference mechanism with a probabilistic diffusion-based non-autoregressive transformer model. The fuzzy feature inference mechanism, embedded within a condition extractor, automatically extracts feature sets solely from raw speech audio data. These extracted features are subsequently employed as input conditions to facilitate the generation of personalized 3D full-body gestures. The condition extractor effectively leverages the WavLM large-scale pre-trained model to seamlessly capture local and global audio information into a unified latent representation associated with gestures. This all pertinent information eliminates the necessity for manual annotation labels, thereby streamlining multimodal processing. Furthermore, we employ adaptive layer normalization to enhance the modeling of the intricate relationship between speech and gestures. Finally, the learning and synthesis stages are facilitated through a diffusion process, leading to a wide range of gesture-generation outcomes. Extensive subjective and objective evaluations conducted on three high-quality co-speech gesture datasets (Trinity, ZEGGS, and BEAT) demonstrate our method’s superior performance compared to recent approaches.
27 Feb 2024Submitted to TechRxiv
27 Feb 2024Published in TechRxiv