LawDNet: Enhanced Audio-Driven Lip Synthesis via Local Affine Warping Deformation

Junli Deng1†, Yihao Luo2†*, Xueting Yang3, Siyou Li4, Wei Wang5, Jinyang Guo6, Ping Shi1,
1Communication University of China, 2Imperial College London, 3HKU, 4Queen Mary University of London 5Beijing University of Posts and Telecommunications, 6Beihang University, Beijing, China, *Corresponding Author Equal contribution

We will release our paper and code upon acceptance. We extend our gratitude to Hao Wang for assisting us in data collection. Special thanks to Qipei Li and Qirong Liang for reviewing the format of our manuscript.

Audio-Driven Lip Synthesis via Local Affine Warping Deformation.

Abstract

In the domain of photorealistic avatar generation, the fidelity of audio-driven lip motion synthesis is essential for realistic virtual interactions. Existing methods face two key challenges: a lack of vivacity due to limited diversity in generated lip poses and noticeable anamorphose motions caused by poor temporal coherence. To address these issues, we propose LawDNet, a novel deep-learning architecture enhancing lip synthesis through a Local Affine Warping Deformation mechanism. This mechanism models the intricate lip movements in response to the audio input by controllable non-linear warping fields. These fields consist of local affine transformations focused on abstract keypoints within deep feature maps, offering a novel universal paradigm for feature warping in networks. Additionally, LawDNet incorporates a dual-stream discriminator for improved frame-to-frame continuity and employs face normalization techniques to handle pose and scene variations. Extensive evaluations demonstrate LawDNet's superior robustness and lip movement dynamism performance compared to previous methods. The advancements presented in this paper, including the methodologies, training data, source codes, and pre-trained models, will be made accessible to the research community.

LawDNet示意图

Method

Italian Trulli

Overview of the LawDNet framework. The data preprocessing stage prepares the model inputs through face frontalization and soft masking. Feature modulation is driven by audio and visual cues to learn keypoint locations, radii, and affine parameters, leading to the deformation of the coarse-grained grid. These deformations are then mapped onto feature maps, and the warped feature maps are processed by the generator \(G\) to produce accurate lip-synced images. Backward propagation leverages dual discriminators \(D_T\) and \(D_S\), alongside multi-level loss functions, to refine the training of \(G\).

Demonstration Videos

LawDNet - Chinese

LawDNet - Chinese 2

LawDNet - English

LawDNet - Extreme Head Movement