Abstract
Recently, the region proposal networks (RPN) have been
combined with the Siamese network for tracking, and shown
excellent accuracy with high efficiency. Nevertheless, previously proposed one-stage Siamese-RPN trackers degenerate
in presence of similar distractors and large scale variation.
Addressing these issues, we propose a multi-stage tracking
framework, Siamese Cascaded RPN (C-RPN), which consists of a sequence of RPNs cascaded from deep high-level
to shallow low-level layers in a Siamese network. Compared to previous solutions, C-RPN has several advantages:
(1) Each RPN is trained using the outputs of RPN in the
previous stage. Such process stimulates hard negative sampling, resulting in more balanced training samples. Consequently, the RPNs are sequentially more discriminative
in distinguishing difficult background (i.e., similar distractors). (2) Multi-level features are fully leveraged through a
novel feature transfer block (FTB) for each RPN, further improving the discriminability of C-RPN using both high-level
semantic and low-level spatial information. (3) With multiple steps of regressions, C-RPN progressively refines the
location and shape of the target in each RPN with adjusted
anchor boxes in the previous stage, which makes localization more accurate. C-RPN is trained end-to-end with the
multi-task loss function. In inference, C-RPN is deployed as
it is, without any temporal adaption, for real-time tracking.
In extensive experiments on OTB-2013, OTB-2015, VOT-
2016, VOT-2017, LaSOT and TrackingNet, C-RPN consistently achieves state-of-the-art results and runs in real-time.