Abstract
While recent years have witnessed astonishing improvements in visual tracking robustness, the advancements in
tracking accuracy have been limited. As the focus has been
directed towards the development of powerful classifiers,
the problem of accurate target state estimation has been
largely overlooked. In fact, most trackers resort to a simple
multi-scale search in order to estimate the target bounding
box. We argue that this approach is fundamentally limited
since target estimation is a complex task, requiring highlevel knowledge about the object.
We address this problem by proposing a novel tracking architecture, consisting of dedicated target estimation
and classification components. High level knowledge is incorporated into the target estimation through extensive of-
fline learning. Our target estimation component is trained
to predict the overlap between the target object and an
estimated bounding box. By carefully integrating targetspecific information, our approach achieves previously unseen bounding box accuracy. We further introduce a classification component that is trained online to guarantee
high discriminative power in the presence of distractors.
Our final tracking framework sets a new state-of-the-art
on five challenging benchmarks. On the new large-scale
TrackingNet dataset, our tracker ATOM achieves a relative gain of 15% over the previous best approach, while
running at over 30 FPS. Code and models are available at
https://github.com/visionml/pytracking.