Abstract
Unsupervised learning requires a grouping step that defines which data belong together. A natural way of grouping in images is the segmentation of ob jects or parts of ob jects. While pure bottom-up seg- mentation from static cues is well known to be ambiguous at the ob ject level, the story changes as soon as ob jects move. In this paper, we present a method that uses long term point tra jectories based on dense optical flow. Defining pair-wise distances between these tra jectories allows to cluster them, which results in temporally consistent segmentations of moving ob jects in a video shot. In contrast to multi-body factorization, points and even whole ob jects may appear or disappear during the shot. We provide a benchmark dataset and an evaluation method for this so far uncovered setting.