Abstract
The tracking-by-detection framework consists of two
stages, i.e., drawing samples around the target object in the
first stage and classifying each sample as the target object
or as background in the second stage. The performance of
existing trackers using deep classification networks is limited by two aspects. First, the positive samples in each
frame are highly spatially overlapped, and they fail to capture rich appearance variations. Second, there exists extreme class imbalance between positive and negative samples. This paper presents the VITAL algorithm to address
these two problems via adversarial learning. To augment
positive samples, we use a generative network to randomly
generate masks, which are applied to adaptively dropout
input features to capture a variety of appearance changes.
With the use of adversarial learning, our network identi-
fies the mask that maintains the most robust features of the
target objects over a long temporal span. In addition, to
handle the issue of class imbalance, we propose a highorder cost sensitive loss to decrease the effect of easy negative samples to facilitate training the classification network.
Extensive experiments on benchmark datasets demonstrate
that the proposed tracker performs favorably against stateof-the-art approaches