Abstract. Current video generation/prediction/completion results are
limited, due to the severe ill-posedness inherent in these three problems.
In this paper, we focus on human action videos, and propose a general,
two-stage deep framework to generate human action videos with no constraints or arbitrary number of constraints, which uniformly addresses
the three problems: video generation given no input frames, video prediction given the first few frames, and video completion given the first and
last frames. To solve video generation from scratch, we build a two-stage
framework where we first train a deep generative model that generates
human pose sequences from random noise, and then train a skeleton-toimage network to synthesize human action videos given the human pose
sequences generated. To solve video prediction and completion, we exploit our trained model and conduct optimization over the latent space to
generate videos that best suit the given input frame constraints. With our
novel method, we sidestep the original ill-posed problems and produce
for the first time high-quality video generation/prediction/completion
results of much longer duration. We present quantitative and qualitative evaluations to show that our approach outperforms state-of-the-art
methods in all three tasks