Abstract
Visual recognition requires to learn ob ject models from train- ing data. Commonly, training samples are annotated by marking only the bounding-box of ob jects, since this appears to be the best trade- off between labeling information and effectiveness. However, ob jects are typically not box-shaped. Thus, the usual parametrization of ob ject hy- potheses by only their location, scale and aspect ratio seems inappropri- ate since the box contains a significant amount of background clutter. Most important, however, is that ob ject shape becomes only explicit once ob jects are segregated from the background. Segmentation is an ill-posed problem and so we propose an approach for learning ob ject models for de- tection while, simultaneously, learning to segregate ob jects from clutter and extracting their overall shape. For this purpose, we exclusively use bounding-box annotated training data. The approach groups fragmented ob ject regions using the Multiple Instance Learning (MIL) framework to obtain a meaningful representation of ob ject shape which, at the same time, crops away distracting background clutter to improve the appear- ance representation.