Abstract. Despite many advances in deep-learning based semantic segmentation, performance drop due to distribution mismatch is often encountered in the real world. Recently, a few domain adaptation and active
learning approaches have been proposed to mitigate the performance
drop. However, very little attention has been made toward leveraging
information in videos which are naturally captured in most camera systems. In this work, we propose to leverage “motion prior” in videos for
improving human segmentation in a weakly-supervised active learning
setting. By extracting motion information using optical flow in videos,
we can extract candidate foreground motion segments (referred to as
motion prior) potentially corresponding to human segments. We propose to learn a memory-network-based policy model to select strong
candidate segments (referred to as strong motion prior) through reinforcement learning. The selected segments have high precision and are
directly used to finetune the model. In a newly collected surveillance
camera dataset and a publicly available UrbanStreet dataset, our proposed method improves the performance of human segmentation across
multiple scenes and modalities (i.e., RGB to Infrared (IR)). Last but not
least, our method is empirically complementary to existing domain adaptation approaches such that additional performance gain is achieved by
combining our weakly-supervised active learning approach with domain
adaptation approaches