Abstract
In this paper, we present LaSOT, a high-quality benchmark for Large-scale Single Object Tracking. LaSOT consists of 1,400 sequences with more than 3.5M frames in total. Each frame in these sequences is carefully and manually annotated with a bounding box, making LaSOT the
largest, to the best of our knowledge, densely annotated
tracking benchmark. The average video length of LaSOT
is more than 2,500 frames, and each sequence comprises
various challenges deriving from the wild where target objects may disappear and re-appear again in the view. By releasing LaSOT, we expect to provide the community with a
large-scale dedicated benchmark with high quality for both
the training of deep trackers and the veritable evaluation of
tracking algorithms. Moreover, considering the close connections of visual appearance and natural language, we enrich LaSOT by providing additional language specification,
aiming at encouraging the exploration of natural linguistic
feature for tracking. A thorough experimental evaluation of
35 tracking algorithms on LaSOT is presented with detailed
analysis, and the results demonstrate that there is still a big
room for improvements.