DiffusionTalker: Efficient and Compact Speech-Driven 3D Talking Head via Personalizer-Guided Distillation

Peng Chen1,2, Xiaobao Wei1,2, Ming Lu3, Hui Chen1,2, Feng Tian1,2
1Institute of Software, Chinese Academy of Sciences, 2University of Chinese Academy of Sciences, 3Intel Labs China

Comparison Demo Videos with SOTA Methods

Abstract

Pipeline Image

Real-time speech-driven 3D facial animation has been attractive in academia and industry. Traditional methods mainly focus on learning a deterministic mapping from speech to animation. Recent approaches start to consider the nondeterministic fact of speech-driven 3D face animation and employ the diffusion model for the task. Existing diffusion-based methods can improve the diversity of facial animation. However, personalized speaking styles conveying accurate lip language is still lacking, besides, efficiency and compactness still need to be improved. In this work, we propose DiffusionTalker to address the above limitations via personalizer-guided distillation. In terms of personalization, we introduce a contrastive personalizer that learns identity and emotion embeddings to capture speaking styles from audio. We further propose a personalizer enhancer during distillation to enhance the influence of embeddings on facial animation. For efficiency, we use iterative distillation to reduce the steps required for animation generation and achieve more than 8x speedup in inference. To achieve compactness, we distill the large teacher model into a smaller student model, reducing our model's storage by 86.4% while minimizing performance loss. After distillation, users can derive their identity and emotion embeddings from audio to quickly create personalized animations that reflect specific speaking styles. Extensive experiments are conducted to demonstrate that our method outperforms state-of-the-art methods. The code will be released.

Method

DiffusionTalker employs a contrastive personalizer to extract audio features and personalized embeddings from the input speech. These representations serve as conditioning inputs to guide the motion decoder in denoising noisy facial animations effectively. In personalizer-guided distillation process, the number of steps in the student model is iteratively reduced to half of the original, significantly accelerating inference. Simultaneously, the model parameters are compressed to create a more compact and efficient model. The personalizer enhancer integrates the personalized embedding with the upper facial areas of the predicted results, leveraging contrastive learning to strengthen the embedding's representational capacity.

DiffusionTalker-Driven 3DGS Talking Head

Recently, 3D Gaussian Splatting (3DGS) has gained increasing attention in the field of head avatar synthesis due to its ability to significantly improve rendering accuracy and inference speed. To integrate 3DGS into our DiffusionTalker model, we bind 3D Gaussians to the triangular faces of the FLAME mesh, enabling the generation of DiffusionTalker-driven 3DGS talking heads. As shown in the video, while we only present a rough demo, it demonstrates that our DiffusionTalker can also exhibit accurate emotion in 3DGS-based talking head generation. Moving forward, we will continue to explore and research this area further.