Abstract
We address the problem of segmenting multiple objectinstances in complex videos. Our method does not requiremanual pixel-level annotation for training, and relies in-stead on readily-available object detectors or visual objecttracking only. Given object bounding boxes at input, we cast video segmentation as a weakly-supervised learning problem. Our proposed objective combines (a) a discriminative clustering term for background segmentation, (b) a spectral clustering one for grouping pixels of same objectinstances, and (c) linear constraints enabling instance-level segmentation. We propose a convex relaxation of this problem and solve it efficiently using the Frank-Wolfe algorithm. We report results and compare our method to several baselines on a new video dataset for multi-instance person segmentation.