Abstract
Person Re-Identification (ReID) requires comparing two
images of person captured under different conditions. Existing work based on neural networks often computes the similarity of feature maps from one single convolutional layer.
In this work, we propose an efficient, end-to-end fully convolutional Siamese network that computes the similarities
at multiple levels. We demonstrate that multi-level similarity can improve the accuracy considerably using lowcomplexity network structures in ReID problem. Specifi-
cally, first, we use several convolutional layers to extract the
features of two input images. Then, we propose Convolution
Similarity Network to compute the similarity score maps for
the inputs. We use spatial transformer networks (STNs) to
determine spatial attention. We propose to apply efficient
depth-wise convolution to compute the similarity. The proposed Convolution Similarity Networks can be inserted into
different convolutional layers to extract visual similarities
at different levels. Furthermore, we use an improved ranking loss to further improve the performance. Our work is the
first to propose to compute visual similarities at low, middle
and high levels for ReID. With extensive experiments and
analysis, we demonstrate that our system, compact yet effective, can achieve competitive results with much smaller
model size and computational complexity