Abstract
We present an unsupervised approach that generates a diverse, ranked set of bounding box and segmentation video object proposals—spatio-temporal tubes that localize the foreground objects—in an unannotated video. In contrast to previous unsupervised methods that either track regions initialized in an arbitrary frame or train a fifixed model over a cluster of regions, we instead discover a set of easy-togroup instances of an object and then iteratively update its appearance model to gradually detect harder instances in temporally-adjacent frames. Our method fifirst generates a set of spatio-temporal bounding box proposals, and then refifines them to obtain pixel-wise segmentation proposals. We demonstrate state-of-the-art segmentation results on the SegTrack v2 dataset, and bounding box tracking results that perform competitively to state-of-the-art supervised tracking methods.