Improving Deep Visual Representation forPerson Re-identification by Global and LocalImage-language Association
Abstract. Person re-identification is an important task that requires
learning discriminative visual features for distinguishing different person
identities. Diverse auxiliary information has been utilized to improve the
visual feature learning. In this paper, we propose to exploit natural language description as additional training supervisions for effective visual
features. Compared with other auxiliary information, language can describe a specific person from more compact and semantic visual aspects,
thus is complementary to the pixel-level image data. Our method not
only learns better global visual feature with the supervision of the overall
description but also enforces semantic consistencies between local visual
and linguistic features, which is achieved by building global and local
image-language associations. The global image-language association is
established according to the identity labels, while the local association is
based upon the implicit correspondences between image regions and noun
phrases. Extensive experiments demonstrate the effectiveness of employing language as training supervisions with the two association schemes.
Our method achieves state-of-the-art performance without utilizing any
auxiliary information during testing and shows better performance than
other joint embedding methods for the image-language association