Abstract
We present an efficient method for the semi-supervised
video object segmentation. Our method achieves accuracy
competitive with state-of-the-art methods while running in a
fraction of time compared to others. To this end, we propose
a deep Siamese encoder-decoder network that is designed
to take advantage of mask propagation and object detection while avoiding the weaknesses of both approaches. Our
network, learned through a two-stage training process that
exploits both synthetic and real data, works robustly without any online learning or post-processing. We validate our
method on four benchmark sets that cover single and multiple object segmentation. On all the benchmark sets, our
method shows comparable accuracy while having the order of magnitude faster runtime. We also provide extensive
ablation and add-on studies to analyze and evaluate our
framework.