Abstract
Human pose estimation and semantic part segmentation
are two complementary tasks in computer vision. In this
paper, we propose to solve the two tasks jointly for natural
multi-person images, in which the estimated pose provides
object-level shape prior to regularize part segments while
the part-level segments constrain the variation of pose locations. Specifically, we first train two fully convolutional neural networks (FCNs), namely Pose FCN and Part FCN, to
provide initial estimation of pose joint potential and semantic part potential. Then, to refine pose joint location, the two
types of potentials are fused with a fully-connected conditional random field (FCRF), where a novel segment-joint
smoothness term is used to encourage semantic and spatial
consistency between parts and joints. To refine part segments, the refined pose and the original part potential are
integrated through a Part FCN, where the skeleton feature
from pose serves as additional regularization cues for part
segments. Finally, to reduce the complexity of the FCRF, we
induce human detection boxes and infer the graph inside
each box, making the inference forty times faster.
Since there’s no dataset that contains both part segments and pose labels, we extend the PASCAL VOC part
dataset [6] with human pose joints1 and perform extensive
experiments to compare our method against several most
recent strategies. We show that our algorithm surpasses
competing methods by 10.6% in pose estimation with much
faster speed and by 1.5% in semantic part segmentation