Abstract
Understanding the simultaneously very diverse and intricately fine-grained set of possible human actions is a
critical open problem in computer vision. Manually labeling training videos is feasible for some action classes but
doesn’t scale to the full long-tailed distribution of actions. A
promising way to address this is to leverage noisy data from
web queries to learn new actions, using semi-supervised or
“webly-supervised” approaches. However, these methods
typically do not learn domain-specific knowledge, or rely on
iterative hand-tuned data labeling policies. In this work, we
instead propose a reinforcement learning-based formulation for selecting the right examples for training a classifier
from noisy web search results. Our method uses Q-learning
to learn a data labeling policy on a small labeled training
dataset, and then uses this to automatically label noisy web
data for new visual concepts. Experiments on the challenging Sports-1M action recognition benchmark as well as on
additional fine-grained and newly emerging action classes
demonstrate that our method is able to learn good labeling
policies for noisy data and use this to learn accurate visual
concept classifiers.