Scenes-Objects-Actions: A Multi-Task, Multi-Label Video Dataset

资源分类

2019-10-22 |

54 |

31 |

Abstract. This paper introduces a large-scale, multi-label and multitask video dataset named Scenes-Objects-Actions (SOA). Most prior video datasets are based on a predefined taxonomy, which is used to de- fine the keyword queries issued to search engines. The videos retrieved by the search engines are then verified for correctness by human annotators. Datasets collected in this manner tend to generate high classification accuracy as search engines typically rank “easy” videos first. The SOA dataset adopts a different approach. We rely on uniform sampling to get a better representation of videos on the Web. Trained annotators are asked to provide free-form text labels describing each video in three different aspects: scene, object and action. These raw labels are then merged, split and renamed to generate a taxonomy for SOA. All the annotations are verified again based on the taxonomy. The final dataset includes 562K videos with 3.64M annotations spanning 49 categories for scenes, 356 for objects, 148 for actions, and naturally captures the long tail distribution of visual concepts in the real world. We show that datasets collected in this way are quite challenging by evaluating existing popular video models on SOA. We provide in-depth analysis about the performance of different models on SOA, and highlight potential new directions in video classification. We compare SOA with existing datasets and discuss various factors that impact the performance of transfer learning. A keyfeature of SOA is that it enables the empirical study of correlation among scene, object and action recognition in video. We present results of this study and further analyze the potential of using the information learned from one task to improve the others. We also demonstrate different ways of scaling up SOA to learn better features. We believe that the challenges presented by SOA offer the opportunity for further advancement in video analysis as we progress from single-label classification towards a more comprehensive understanding of video data

上一篇：Task-Aware Image Downscaling

下一篇：Diverse Image-to-Image Translation via Disentangled Representations

用户评价

全部评价

还没有评论，说两句吧！

热门资源

Stratified Strate...

In this paper we introduce Stratified Strategy ...
The Variational S...

Unlike traditional images which do not offer in...
Learning to Predi...

Much of model-based reinforcement learning invo...
Learning to learn...

The move from hand-designed features to learned...
A Mathematical Mo...

Direct democracy, where each voter casts one vo...

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com