Abstract
We propose a novel method for removing irrelevant frames from a video given user-provided frame-level labeling for a very small number of frames. We first hypothesize a number of candidate areas which possibly contain the ob ject of interest, and then figure out which area(s) truly contain the ob ject of interest. Our method enjoys several favorable properties. First, compared to approaches where a single de- scriptor is used to describe a whole frame, each area’s feature descriptor has the chance of genuinely describing the ob ject of interest, hence it is less affected by background clutter. Second, by considering the tempo- ral continuity of a video instead of treating the frames as independent, we can hypothesize the location of the candidate areas more accurately. Third, by infusing prior knowledge into the topic-motion model, we can precisely follow the tra jectory of the ob ject of interest. This allows us to largely reduce the number of candidate areas and hence reduce the chance of overfitting the data during learning. We demonstrate the effec- tiveness of the method by comparing it to several other semi-supervised learning approaches on challenging video clips.