Abstract
Siamese network based trackers formulate tracking as
convolutional feature cross-correlation between a target
template and a search region. However, Siamese trackers still have an accuracy gap compared with state-of-theart algorithms and they cannot take advantage of features
from deep networks, such as ResNet-50 or deeper. In this
work we prove the core reason comes from the lack of strict
translation invariance. By comprehensive theoretical analysis and experimental validations, we break this restriction
through a simple yet effective spatial aware sampling strategy and successfully train a ResNet-driven Siamese tracker
with significant performance gain. Moreover, we propose
a new model architecture to perform layer-wise and depthwise aggregations, which not only further improves the accuracy but also reduces the model size. We conduct extensive ablation studies to demonstrate the effectiveness of the
proposed tracker, which obtains currently the best results
on five large tracking benchmarks, including OTB2015,
VOT2018, UAV123, LaSOT, and TrackingNet.