DiffusionTalker: Personalization and Acceleration for Speech-Driven 3D Face Diffuser

Peng Chen*1,2, Xiaobao Wei*1,2, Ming Lu3, Yitong Zhu1,2, Naiming Yao1, Xingyu Xiao4,
Hui Chen1
1Institute of Software, Chinese Academy of Sciences, 2University of Chinese Academy of Sciences, 3Intel Labs China, 4Tsinghua University

DiffusionTalker: We reduce the steps of the diffusion model for faster inference by knowledge distillation. Based on the model with fewer steps, given an audio sequence, we can find a matching identity embedding in the identity library to personalize the speaker's talking style.

Abstract

Speech-driven 3D facial animation has been an attractive task in both academia and industry. Traditional methods mostly focus on learning a deterministic mapping from speech to animation. Recent approaches start to consider the non-deterministic fact of speech-driven 3D face animation and employ the diffusion model for the task. However, personalizing facial animation and accelerating animation generation are still two major limitations of existing diffusion-based methods. To address the above limitations, we propose DiffusionTalker, a diffusion-based method that utilizes contrastive learning to personalize 3D facial animation and knowledge distillation to accelerate 3D animation generation. Specifically, to enable personalization, we introduce a learnable talking identity to aggregate knowledge in audio sequences. The proposed identity embeddings extract customized facial cues across different people in a contrastive learning manner. During inference, users can obtain personalized facial animation based on input audio, reflecting a specific talking style. With a trained diffusion model with hundreds of steps, we distill it into a lightweight model with 8 steps for acceleration. Extensive experiments are conducted to demonstrate that our method outperforms state-of-the-art methods. The code will be released.



Proposed Method

diffusiontalker


DiffusionTalker continuously removes Gaussian noise from noise-added facial animation during the Denoise Process while updating the model's parameters to generate facial animation based on input speech. In each step, the model consists of two parts: the personalization adapter, which uses contrastive learning to match speech and identity features, and the forward procedure, which adds noise to the facial parameters. Finally, all the information is fed into the talker decoder to predict facial animation. DiffusionTalker employs a training approach which is knowledge distillation and reduces the number of sampling steps by half.

BibTeX


      @article{chen2023diffusiontalker,
        title={DiffusionTalker: Personalization and Acceleration for Speech-Driven 3D Face Diffuser},
        author={Chen, Peng and Wei, Xiaobao and Lu, Ming and Zhu, Yitong and Yao, Naiming and Xiao, Xingyu and Chen, Hui},
        journal={arXiv preprint arXiv:2311.16565},
        year={2023}
      }