Commenced in January 2007
Frequency: Monthly
Edition: International
Paper Count: 70198
Disentangling Audio Content and Emotion with Adaptive Instance Normalization for Expressive Facial Animation Synthesis

Authors: Che-Jui Chang, Long Zhao, Mubbasir Kapadia

Abstract:

3D facial animation synthesis from audio has been a focus in recent years. However, most existing works in the literature are designed for the mapping between audio and visual content, providing limited knowledge regarding the relationship between emotion in audio and expressive facial animation. In this paper, we aim to generate audio-matching facial animations with the specified emotion label. In such a task, we argue that separating the content from audio is indispensable -the proposed model must learn to generate facial contents from audio contents while expressions from the specified emotion. We achieve it by an adaptive instance normalization (AdaIN) module that isolates the content in the audio and combines the emotion embedding from the specified label. The joint content-emotion embedding is then used to generate 3D facial vertices and texture maps. We compare our method with state-of-the-art baselines, including the facial segmentation-based and voice conversion-based disentanglement approaches. We also conducted a user study to evaluate the performance of emotion conditioning, and the results indicate our proposed method outperforms the baselines in both the animation quality and accuracy of expression categorization.

Keywords: adaptive instance normalization, audio-driven animation, content-emotion disentanglement, emotion-conditioning, expressive facial animation synthesis

Procedia PDF Downloads 72