Abstract
We describe a modular framework for video frame prediction. We refer to it as a Flexible Spatio-Temporal Network (FSTN) as it allows the extrapolation of a video sequence as well as the estimation of synthetic frames lying in between observed frames and thus the generation
of slow-motion videos. By devising a customized objective function comprising decoding, encoding, and adversarial losses, we are able to mitigate the common problem of
blurry predictions, managing to retain high frequency information even for relatively distant future predictions. We
propose and analyse different training strategies to optimize
our model. Extensive experiments on several challenging
public datasets demonstrate both the versatility and validity of our model.