Abstract. Existing video prediction methods mainly rely on observing
multiple historical frames or focus on predicting the next one-frame. In
this work, we study the problem of generating consecutive multiple future frames by observing one single still image only. We formulate the
multi-frame prediction task as a multiple time step flow (multi-flow) prediction phase followed by a flow-to-frame synthesis phase. The multi-flow
prediction is modeled in a variational probabilistic manner with spatialtemporal relationships learned through 3D convolutions. The flow-toframe synthesis is modeled as a generative process in order to keep the
predicted results lying closer to the manifold shape of real video sequence.
Such a two-phase design prevents the model from directly looking at the
high-dimensional pixel space of the frame sequence and is demonstrated
to be more effective in predicting better and diverse results. Extensive
experimental results on videos with different types of motion show that
the proposed algorithm performs favorably against existing methods in
terms of quality, diversity and human perceptual evaluation