Abstract. This paper introduces a large-scale, multi-label and multitask video dataset named Scenes-Objects-Actions (SOA). Most prior
video datasets are based on a predefined taxonomy, which is used to de-
fine the keyword queries issued to search engines. The videos retrieved
by the search engines are then verified for correctness by human annotators. Datasets collected in this manner tend to generate high classification
accuracy as search engines typically rank “easy” videos first. The SOA
dataset adopts a different approach. We rely on uniform sampling to get a
better representation of videos on the Web. Trained annotators are asked
to provide free-form text labels describing each video in three different
aspects: scene, object and action. These raw labels are then merged, split
and renamed to generate a taxonomy for SOA. All the annotations are
verified again based on the taxonomy. The final dataset includes 562K
videos with 3.64M annotations spanning 49 categories for scenes, 356 for
objects, 148 for actions, and naturally captures the long tail distribution
of visual concepts in the real world. We show that datasets collected
in this way are quite challenging by evaluating existing popular video
models on SOA. We provide in-depth analysis about the performance
of different models on SOA, and highlight potential new directions in
video classification. We compare SOA with existing datasets and discuss
various factors that impact the performance of transfer learning. A keyfeature of SOA is that it enables the empirical study of correlation among
scene, object and action recognition in video. We present results of this
study and further analyze the potential of using the information learned
from one task to improve the others. We also demonstrate different ways
of scaling up SOA to learn better features. We believe that the challenges presented by SOA offer the opportunity for further advancement
in video analysis as we progress from single-label classification towards
a more comprehensive understanding of video data