Abstract
We present an unsupervised visual feature learning algo-rithm driven by context-based pixel prediction. By analogywith auto-encoders, we propose Context Encoders – a con-volutional neural network trained to generate the contentsof an arbitrary image region conditioned on its surround-ings. In order to succeed at this task, context encodersneed to both understand the content of the entire image,as well as produce a plausible hypothesis for the missing part(s). When training context encoders, we have experi-mented with both a standard pixel-wise reconstruction loss,as well as a reconstruction plus an adversarial loss. Thelatter produces much sharper results because it can betterhandle multiple modes in the output. We found that a con-text encoder learns a representation that captures not justappearance but also the semantics of visual structures. Wequantitatively demonstrate the effectiveness of our learnedfeatures for CNN pre-training on classification, detection,and segmentation tasks. Furthermore, context encoders canbe used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.