Abstract
Video object segmentation targets segmenting a specific
object throughout a video sequence when given only an annotated first frame. Recent deep learning based approaches
find it effective to fine-tune a general-purpose segmentation model on the annotated frame using hundreds of iterations of gradient descent. Despite the high accuracy
that these methods achieve, the fine-tuning process is inefficient and fails to meet the requirements of real world
applications. We propose a novel approach that uses a
single forward pass to adapt the segmentation model to
the appearance of a specific object. Specifically, a second
meta neural network named modulator is trained to manipulate the intermediate layers of the segmentation network
given limited visual and spatial information of the target
object. The experiments show that our approach is 70×
faster than fine-tuning approaches and achieves similar accuracy. Our model and code have been released at https:
//github.com/linjieyangsc/video_seg