Abstract
In order to avoid an expensive manual labelling process or to learn ob ject classes autonomously without human intervention, ob ject discovery techniques have been proposed that extract visually similar ob- jects from weakly labelled videos. However, the problem of discovering small or medium sized ob jects is largely unexplored. We observe that videos with activities involving human-ob ject interactions can serve as weakly labelled data for such cases. Since neither ob ject appearance nor motion is distinct enough to discover ob jects in such videos, we propose a framework that samples from a space of algorithms and their parameters to extract sequences of ob ject proposals. Furthermore, we model simi- larity of ob jects based on appearance and functionality, which is derived from human and ob ject motion. We show that functionality is an im- portant cue for discovering ob jects from activities and demonstrate the generality of the model on three challenging RGB-D and RGB datasets.