Abstract. In this paper, we consider the task: given an arbitrary audio
speech and one lip image of arbitrary target identity, generate synthesized lip movements of the target identity saying the speech. To perform
well, a model needs to not only consider the retention of target identity,
photo-realistic of synthesized images, consistency and smoothness of lip
images in a sequence, but more importantly, learn the correlations between audio speech and lip movements. To solve the collective problems,
we devise a network to synthesize lip movements and propose a novel
correlation loss to synchronize lip changes and speech changes. Our full
model utilizes four losses for a comprehensive consideration; it is trained
end-to-end and is robust to lip shapes, view angles and different facial
characteristics. Thoughtful experiments on three datasets ranging from
lab-recorded to lips in-the-wild show that our model significantly outperforms other state-of-the-art methods extended to this task.