Abstract
The status quo approach to training object detectors re-quires expensive bounding box annotations. Our frameworktakes a markedly different direction: we transfer tracked ob-ject boxes from weakly-labeled videos to weakly-labeled im-ages to automatically generate pseudo ground-truth boxes, which replace manually annotated bounding boxes. Wefirst mine discriminative regions in the weakly-labeled im-age collection that frequently/rarely appear in the positive/negative images. We then match those regions to videosand retrieve the corresponding tracked object boxes. Finally, we design a hough transform algorithm to vote for the best box to serve as the pseudo GT for each image, and use them to train an object detector. Together, these lead to state-of-the-art weakly-supervised detection results on the PASCAL 2007 and 2010 datasets.