Abstract
This paper presents a novel yet intuitive approach to unsupervised feature learning. Inspired by the human visual
system, we explore whether low-level motion-based grouping cues can be used to learn an effective visual representation. Specifically, we use unsupervised motion-based segmentation on videos to obtain segments, which we use as
‘pseudo ground truth’ to train a convolutional network to
segment objects from a single frame. Given the extensive
evidence that motion plays a key role in the development of
the human visual system, we hope that this straightforward
approach to unsupervised learning will be more effective
than cleverly designed ‘pretext’ tasks studied in the literature. Indeed, our extensive experiments show that this is the
case. When used for transfer learning on object detection,
our representation significantly outperforms previous unsupervised approaches across multiple settings, especially
when training data for the target task is scarce