Abstract
Multi-shot pedestrian re-identification problem is at the
core of surveillance video analysis. It matches two tracks of
pedestrians from different cameras. In contrary to existing
works that aggregate single frames features by time series
model such as recurrent neural network, in this paper, we
propose an interpretable reinforcement learning based approach to this problem. Particularly, we train an agent to
verify a pair of images at each time. The agent could choose
to output the result (same or different) or request another
pair of images to verify (unsure). By this way, our model
implicitly learns the difficulty of image pairs, and postpone
the decision when the model does not accumulate enough
evidence. Moreover, by adjusting the reward for unsure
action, we can easily trade off between speed and accuracy.
In three open benchmarks, our method are competitive with
the state-of-the-art methods while only using 3% to 6% images. These promising results demonstrate that our method
is favorable in both efficiency and performance