Abstract
We propose a method for multi-person detection and 2-
D pose estimation that achieves state-of-art results on the
challenging COCO keypoints task. It is a simple, yet powerful, top-down approach consisting of two stages.
In the first stage, we predict the location and scale of
boxes which are likely to contain people; for this we use
the Faster RCNN detector. In the second stage, we estimate
the keypoints of the person potentially contained in each
proposed bounding box. For each keypoint type we predict dense heatmaps and offsets using a fully convolutional
ResNet. To combine these outputs we introduce a novel aggregation procedure to obtain highly localized keypoint predictions. We also use a novel form of keypoint-based NonMaximum-Suppression (NMS), instead of the cruder boxlevel NMS, and a novel form of keypoint-based confidence
score estimation, instead of box-level scoring.
Trained on COCO data alone, our final system achieves
average precision of 0.649 on the COCO test-dev set and
the 0.643 test-standard sets, outperforming the winner of
the 2016 COCO keypoints challenge and other recent stateof-art. Further, by using additional in-house labeled data
we obtain an even higher average precision of 0.685 on the
test-dev set and 0.673 on the test-standard set, more than
5% absolute improvement compared to the previous best
performing method on the same dataset