Abstract
Popular deep models for action recognition in videos
generate independent predictions for short clips, which are
then pooled heuristically to assign an action label to the full
video segment. As not all frames may characterize the underlying action—indeed, many are common across multiple
actions—pooling schemes that impose equal importance on
all frames might be unfavorable. In an attempt to tackle this
problem, we propose discriminative pooling, based on the
notion that among the deep features generated on all short
clips, there is at least one that characterizes the action. To
this end, we learn a (nonlinear) hyperplane that separates
this unknown, yet discriminative, feature from the rest. Applying multiple instance learning in a large-margin setup,
we use the parameters of this separating hyperplane as a
descriptor for the full video segment. Since these parameters are directly related to the support vectors in a maxmargin framework, they serve as robust representations for
pooling of the features. We formulate a joint objective and
an efficient solver that learns these hyperplanes per video
and the corresponding action classifiers over the hyperplanes. Our pooling scheme is end-to-end trainable within
a deep framework. We report results from experiments on
three benchmark datasets spanning a variety of challenges
and demonstrate state-of-the-art performance across these
tasks