FusionSeg: Learning to combine motion and appearance for fully automatic
segmentation of generic objects in videos
Abstract
We propose an end-to-end learning framework for segmenting generic objects in videos. Our method learns to
combine appearance and motion information to produce
pixel level segmentation masks for all prominent objects.
We formulate the task as a structured prediction problem
and design a two-stream fully convolutional neural network which fuses together motion and appearance in a
unified framework. Since large-scale video datasets with
pixel level segmentations are lacking, we show how to bootstrap weakly annotated videos together with existing image recognition datasets for training. Through experiments
on three challenging video segmentation benchmarks, our
method substantially improves the state-of-the-art results
for segmenting generic (unseen) objects. Code and pretrained models are available on the project website