Abstract
This paper extends the popular task of multi-object
tracking to multi-object tracking and segmentation (MOTS).
Towards this goal, we create dense pixel-level annotations
for two existing tracking datasets using a semi-automatic
annotation procedure. Our new annotations comprise
65,213 pixel masks for 977 distinct objects (cars and pedestrians) in 10,870 video frames. For evaluation, we extend existing multi-object tracking metrics to this new task.
Moreover, we propose a new baseline method which jointly
addresses detection, tracking, and segmentation with a single convolutional network. We demonstrate the value of
our datasets by achieving improvements in performance
when training on MOTS annotations. We believe that
our datasets, metrics and baseline will become a valuable resource towards developing multi-object tracking approaches that go beyond 2D bounding boxes. We make
our annotations, code, and models available at https:
//www.vision.rwth-aachen.de/page/mots.