Let's Chorus: Partner-aware Hybrid Song-Driven 3D Head Animation

Let's Chorus: Given a segment of chorus song consisting of background music and vocals as input, PaChorus can generates a realistic animations with consistent emotion and dynamic head movement.

Abstract

Singing is a vital form of human emotional expression and social interaction, distinguished from speech by its richer emotional nuances and freer expressive style. Thus, investigating 3D facial animation driven by singing holds significant research value. Our work focuses on 3D singing facial animation driven by mixed singing audio, and to the best of our knowledge, no prior studies have explored this area. Additionally, the absence of existing 3D singing datasets poses a considerable challenge.

To address this, we collect a novel audiovisual dataset, ChorusHead which features synchronized mixed vocal audio and pseudo-3D flame motions for chorus singing. In addition, We propose a partner-aware 3D chorus head generation framework driven by mixed audio inputs. The proposed framework extracts emotional features from the background music and dependence between singers and models the head movement in a latent space from the Variational Autoencoder (VAE), enabling diverse interactive head animation generation. Extensive experimental results demonstrate that our approach effectively generates 3D facial animations of interacting singers, achieving notable improvements in realism and handling background music interference with strong robustness.

Method

Video

We compare our method with the state-of-the-art methods FaceFormer, CodeTalker, SelfTalk, Imitator and Diffspeaker on our ChorusHead dataset. Note that only PaChorus generate choral animations with dynamic head motions while the others do not.

Additionally, we present more visualized results. Notably, we showcase inference results involving multiple singers. Due to current limitations in multi-speaker audio separation models, which are not yet fully reliable for extracting voices from group singing, we synthesized choral audio using known vocal tracks. By conditioning the motion prior module on the voices of other participants during inference, we achieved plausible and credible results.