Head Post Estimation

资源分类

2020-02-05 |

219 |

0 |

Head Post Estimation

The VGG Human Pose Estimation datasets is a set of large video datasets annotated with human upper-body pose.

YouTube Pose [1]

The YouTube Pose dataset is a collection of 50 YouTube videos for human upper body pose estimation. It consists of 50 videos found on YouTube covering a broad range of activities and people, e.g., dancing, stand-up comedy, how-to, sports, disk jockeys, performing arts and dancing sign language signers. One hundred frames from each video have been manually annotated with 2D locations of the upper body joints.

BBC Pose [6]

BBC Pose consists of 20 videos (each 0.5h-1.5h in length) recorded from BBC with an overlaid sign language interpreter.

Split into train/validation/test

The 20 videos are split into 10 videos for training, 5 for validation and 5 for testing. The dataset contains 9 signers; of these 9 signers, the training and validation sets contain 5, and the testing set contains another 4. Splitting the data this way maintains enough diversity for training but also ensures fairness as the test set contains completely different signers than the training and validation sets.

Manual ground truth (validation & testing)

200 frames from each validation and test video are sampled by clustering the signers' poses (using tracking output from Buehler et al. CVPR'09 - see Sect 2 in [3]) and uniformly sampling frames across clusters, yielding in total 1,000 frames for validation and 1,000 frames for testing. Sampling in this manner ensures the accuracy of joint estimates are not biased towards poses which occur more frequently. These 2,000 sampled frames are manually annotated with upper-body joint locations (head, wrists, elbows and shoulders).

Semi-automatic ground truth (training)

In addition to the manual ground truth labels above, all frames of all videos have been assigned joint locations using a semi-automatic but reliable tracker by Buehler et al. CVPR'09. These labels are used as ground truth for training.

Pose visualisation

The above figure shows a scatter plot of stickmen in BBC Pose.

Extended BBC Pose [4]

Extended BBC Pose contains all videos from the BBC Pose dataset plus 72 additional training videos. Combined with the original BBC TV dataset, the dataset contains 92 videos (82 training, 5 validation and 5 testing), i.e. around 7 million frames. The frames of the new 72 videos are automatically assigned joint locations (used as ground truth for training) with the tracker of Charles et al. IJCV'13 [4]. In practice, these `ground truth' joint locations are slightly noisier than those in the original BBC Pose dataset (which were obtained using the slow, semi-automatic tracker of Buehler et al. CVPR'09).

Short BBC Pose [7]

Short BBC Pose contains five one-hour-long videos with sign language signers each with different sleeve length (in contrast to the above datasets, which only contain signers with moderately long sleeves). Each of the five videos has 200 test frames (which have been manually annotated with joint locations), amounting to 1,000 test frames in total. Test frames were selected by the authors to contain a diverse range of poses.

ChaLearn Pose

ChaLearn Pose is a subset of the ChaLearn 2013 Multi-modal gesture dataset from Escalera et al. ICMI'13, which contains 23 hours of Kinect data of 27 persons performing 20 Italian gestures. The data includes RGB, depth, foreground segmentations and full body skeletons. In this dataset, both the training and testing labels are noisy (from Kinect).

Dataset statistics

	BBC Pose	Extended BBC Pose	BBC Short Pose	ChaLearn Pose	YouTube Pose
Total videos	20	92	5	5	50
Train videos	10	82 (10 same)	-	393	-
Val videos	5	5 (same)	-	287	-
Test videos	5	5 (same)	5	275	50
People	9	~40	5	27	50
Frames	1.5M	7M	380K	1.3M	-
Train labels	Buehler et al.	Buehler et al. (10) + Charles et al. (72)	-	Kinect	-
Val labels	1,000 manual GT	1,000 manual GT (same)	-	Kinect	-
Test labels	1,000 manual GT	1,000 manual GT (same)	1,000 manual GT	3,200 Kinect	5,000 manual GT