LawDNet: Enhanced Audio-Driven Lip Synthesis via Local Affine Warping Deformation

¹Communication University of China, ²Imperial College London, ³HKU, ⁴Queen Mary University of London ⁵Beijing University of Posts and Telecommunications, ⁶Beihang University, Beijing, China, ^*Corresponding Author ^†Equal contribution

Abstract

In the domain of photorealistic avatar generation, the fidelity of audio-driven lip motion synthesis is essential for realistic virtual interactions. Existing methods face two key challenges: a lack of vivacity due to limited diversity in generated lip poses and noticeable anamorphose motions caused by poor temporal coherence. To address these issues, we propose LawDNet, a novel deep-learning architecture enhancing lip synthesis through a Local Affine Warping Deformation mechanism. This mechanism models the intricate lip movements in response to the audio input by controllable non-linear warping fields. These fields consist of local affine transformations focused on abstract keypoints within deep feature maps, offering a novel universal paradigm for feature warping in networks. Additionally, LawDNet incorporates a dual-stream discriminator for improved frame-to-frame continuity and employs face normalization techniques to handle pose and scene variations. Extensive evaluations demonstrate LawDNet's superior robustness and lip movement dynamism performance compared to previous methods. The advancements presented in this paper, including the methodologies, training data, source codes, and pre-trained models, will be made accessible to the research community.

Method

Overview of the LawDNet framework. The data preprocessing stage prepares the model inputs through face frontalization and soft masking. Feature modulation is driven by audio and visual cues to learn keypoint locations, radii, and affine parameters, leading to the deformation of the coarse-grained grid. These deformations are then mapped onto feature maps, and the warped feature maps are processed by the generator \(G\) to produce accurate lip-synced images. Backward propagation leverages dual discriminators \(D_T\) and \(D_S\), alongside multi-level loss functions, to refine the training of \(G\).

Demonstration Videos

LawDNet - Chinese

LawDNet - Chinese 2

LawDNet - English

LawDNet - Extreme Head Movement

LawDNet: Enhanced Audio-Driven Lip Synthesis via Local Affine Warping Deformation

Audio-Driven Lip Synthesis via Local Affine Warping Deformation.

Abstract

Method

Demonstration Videos