Abstract
Blind video decaptioning is a problem of automatically
removing text overlays and inpainting the occluded parts in
videos without any input masks. While recent deep learning based inpainting methods deal with a single image and
mostly assume that the positions of the corrupted pixels are
known, we aim at automatic text removal in video sequences
without mask information. In this paper, we propose a simple yet effective framework for fast blind video decaptioning. We construct an encoder-decoder model, where the encoder takes multiple source frames that can provide visible
pixels revealed from the scene dynamics. These hints are
aggregated and fed into the decoder. We apply a residual
connection from the input frame to the decoder output to
enforce our network to focus on the corrupted regions only.
Our proposed model was ranked in the first place in the
ECCV Chalearn 2018 LAP Inpainting Competition Track2:
Video decaptioning. In addition, we further improve this
strong model by applying a recurrent feedback. The recurrent feedback not only enforces temporal coherence but
also provides strong clues on where the corrupted pixels
are. Both qualitative and quantitative experiments demonstrate that our full model produces accurate and temporally
consistent video results in real time (50+ fps).