Abstract
In this paper, we are interested in the human pose estimation problem with a focus on learning reliable highresolution representations. Most existing methods recover
high-resolution representations from low-resolution representations produced by a high-to-low resolution network.
Instead, our proposed network maintains high-resolution
representations through the whole process.
We start from a high-resolution subnetwork as the first
stage, gradually add high-to-low resolution subnetworks
one by one to form more stages, and connect the mutliresolution subnetworks in parallel. We conduct repeated
multi-scale fusions such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich highresolution representations. As a result, the predicted keypoint heatmap is potentially more accurate and spatially
more precise. We empirically demonstrate the effectiveness
of our network through the superior pose estimation results
over two benchmark datasets: the COCO keypoint detection
dataset and the MPII Human Pose dataset. In addition, we
show the superiority of our network in pose tracking on the
PoseTrack dataset. The code and models have been publicly
available at https://github.com/leoxiaobin/
deep-high-resolution-net.pytorch.