Abstract
In this paper we present a tracker, which is radicallydifferent from state-of-the-art trackers: we apply no modelupdating, no occlusion detection, no combination of track-ers, no geometric matching, and still deliver state-of-the-art tracking performance, as demonstrated on the popularonline tracking benchmark (OTB) and six very challeng-ing YouTube videos. The presented tracker simply matchesthe initial patch of the target in the first frame with can-didates in a new frame and returns the most similar patchby a learned matching function. The strength of the match-ing function comes from being extensively trained generically, i.e., without any data of the target, using a Siamese deep neural network, which we design for tracking. Once learned, the matching function is used as is, without any adapting, to track previously unseen targets. It turns out that the learned matching function is so powerful that a simple tracker built upon it, coined Siamese INstance search Tracker, SINT, which only uses the original observation of the target from the first frame, suffices to reach state-of-theart performance. Further, we show the proposed trackereven allows for target re-identification after the target was absent for a complete video shot.