Abstract
We study the task of image inpainting, where an
image with missing region is recovered with plausible context. Recent approaches based on deep
neural networks have exhibited potential for producing elegant detail and are able to take advantage of background information, which gives texture information about missing region in the image.
These methods often perform pixel/patch level replacement on the deep feature maps of missing region and therefore enable the generated content to
have similar texture as background region. However, this kind of replacement is a local strategy and
often performs poorly when the background information is misleading. To this end, in this study,
we propose to use a multi-scale image contextual
attention learning (MUSICAL) strategy that helps
to flexibly handle richer background information
while avoid to misuse of it. However, such strategy
may not promising in generating context of reasonable style. To address this issue, both of the style
loss and the perceptual loss are introduced into the
proposed method to achieve the style consistency
of the generated image. Furthermore, we have also
noticed that replacing some of the down sampling
layers in the baseline network with the stride 1 dilated convolution layers is beneficial for producing
sharper and fine-detailed results. Experiments on
the Paris Street View, Places, and CelebA datasets
indicate the superior performance of our approach
compares to the state-of-the-arts