Abstract
Despite the remarkable progress in recent years, detecting objects in a new context remains a challenging task. Detectors learned from a public dataset can only work with
a fixed list of categories, while training from scratch usually requires a large amount of training data with detailed
annotations. This work aims to explore a novel approach
– learning object detectors from documentary films in a
weakly supervised manner. This is inspired by the observation that documentaries often provide dedicated exposition of certain object categories, where visual presentations
are aligned with subtitles. We believe that object detectors can be learned from such a rich source of information. Towards this goal, we develop a joint probabilistic
framework, where individual pieces of information, including video frames and subtitles, are brought together via
both visual and linguistic links. On top of this formulation,
we further derive a weakly supervised learning algorithm,
where object model learning and training set mining are
unified in an optimization procedure. Experimental results
on a real world dataset demonstrate that this is an effective
approach to learning new object detectors.