Abstract. We address the problem of video object segmentation which
outputs the masks of a target object throughout a video given only
a bounding box in the first frame. There are two main challenges to
this task. First, the background may contain similar objects as the target. Second, the appearance of the target object may change drastically
over time. To tackle these challenges, we propose an end-to-end training network which accomplishes foreground predictions by leveraging the
location-sensitive embeddings which are capable to distinguish the pixels
of similar objects. To deal with appearance changes, for a test video, we
propose a robust model adaptation method which pre-scans the whole
video, generates pseudo foreground/background labels and retrains the
model based on the labels. Our method outperforms the state-of-the-art
methods on the DAVIS and the SegTrack v2 datasets