Abstract. We develop a robust multi-scale structure-aware neural network for human pose estimation. This method improves the recent deep
conv-deconv hourglass models with four key improvements: (1) multiscale supervision to strengthen contextual feature learning in matching
body keypoints by combining feature heatmaps across scales, (2) multiscale regression network at the end to globally optimize the structural
matching of the multi-scale features, (3) structure-aware loss used in the
intermediate supervision and at the regression to improve the matching
of keypoints and respective neighbors to infer a higher-order matching
configurations, and (4) a keypoint masking training scheme that can effectively fine-tune our network to robustly localize occluded keypoints
via adjacent matches. Our method can effectively improve state-of-theart pose estimation methods that suffer from difficulties in scale varieties,
occlusions, and complex multi-person scenarios. This multi-scale supervision tightly integrates with the regression network to effectively (i)
localize keypoints using the ensemble of multi-scale features, and (ii)
infer global pose configuration by maximizing structural consistencies
across multiple keypoints and scales. The keypoint masking training enhances these advantages to focus learning on hard occlusion samples. Our
method achieves the leading position in the MPII challenge leaderboard
among the state-of-the-art methods