Abstract
Video generation and manipulation is an important yet
challenging task in computer vision. Existing methods usually lack ways to explicitly control the synthesized motion.
In this work, we present a conditional video generation
model that allows detailed control over the motion of the
generated video. Given the first frame and sparse motion
trajectories specified by users, our model can synthesize
a video with corresponding appearance and motion. We
propose to combine the advantage of copying pixels from
the given frame and hallucinating the lightness difference
from scratch which help generate sharp video while keeping the model robust to occlusion and lightness change. We
also propose a training paradigm that calculate trajectories
from video clips, which eliminated the need of annotated
training data. Experiments on several standard benchmarks
demonstrate that our approach can generate realistic videos
comparable to state-of-the-art video generation and video
prediction methods while the motion of the generated videos
can correspond well with user input