Abstract. Anticipating future events is an important prerequisite towards intelligent behavior. Video forecasting has been studied as a proxy
task towards this goal. Recent work has shown that to predict semantic
segmentation of future frames, forecasting at the semantic level is more
effective than forecasting RGB frames and then segmenting these. In this
paper we consider the more challenging problem of future instance segmentation, which additionally segments out individual objects. To deal
with a varying number of output labels per image, we develop a predictive model in the space of fixed-sized convolutional features of the
Mask R-CNN instance segmentation model. We apply the “detection
head” of Mask R-CNN on the predicted features to produce the instance
segmentation of future frames. Experiments show that this approach
significantly improves over strong baselines based on optical flow and
repurposed instance segmentation architectures.