Abstract. Comparing the appearance of corresponding body parts is essential
for person re-identification. As body parts are frequently misaligned between the
detected human boxes, an image representation that can handle this misalignment is required. In this paper, we propose a network that learns a part-aligned
representation for person re-identification. Our model consists of a two-stream
network, which generates appearance and body part feature maps respectively,
and a bilinear-pooling layer that fuses two feature maps to an image descriptor.
We show that it results in a compact descriptor, where the image matching similarity is equivalent to an aggregation of the local appearance similarities of the
corresponding body parts. Since the image similarity does not depend on the relative positions of parts, our approach significantly reduces the part misalignment
problem. Training the network does not require any part annotation on the person
re-identification dataset. Instead, we simply initialize the part sub-stream using
a pre-trained sub-network of an existing pose estimation network and train the
whole network to minimize the re-identification loss. We validate the effectiveness of our approach by demonstrating its superiority over the state-of-the-art
methods on the standard benchmark datasets including Market-1501, CUHK03,
CUHK01 and DukeMTMC, and standard video dataset MARS