We provide a dataset with 12 classes of human actions and 10 classes of scenes distributed over 3669 video clips and approximately 20.1 hours of video in total. The dataset intends to provide a comprehensive benchmark for human action recognition in realistic and challenging settings. The dataset is composed of video clips from 69 movies (see the list of movies below). A part of this dataset was originally used in the paper "Actions in Context", Marszałek et al. in Proc. CVPR'09.
Action samples were collected by means of automatic script-to-video alignment in combination with text-based script classification following Laptev at al. CVPR'08. Video samples generated from training movies correspond to the automatic training subset with noisy action labels. Based on this subset we also constructed a clean training subset with action labels manually verified to be correct. We also provide a test subset with manually checked action labels.
Scene classes are selected automatically from scripts such as to maximize co-occurrence with the given action classes and to capture action context as described in Marszałek et al. CVPR'09. Scene video samples are then generated using script-to-video alignment. The labels of test scene samples are manually verified to be correct.
The following tables provide the numbers of video samples in each of the subsets as well as the distributions of class instances in each subset. Note that samples may contain instances of several actions such e.g. kissing and hugging.