Verisimilar Image Synthesis for Accurate
Detection and Recognition of Texts in Scenes
Abstract. The requirement of large amounts of annotated images has
become one grand challenge while training deep neural network models
for various visual detection and recognition tasks. This paper presents
a novel image synthesis technique that aims to generate a large amount
of annotated scene text images for training accurate and robust scene
text detection and recognition models. The proposed technique consists
of three innovative designs. First, it realizes “semantic coherent” synthesis by embedding texts at semantically sensible regions within the
background image, where the semantic coherence is achieved by leveraging the semantic annotations of objects and image regions that have
been created in the prior semantic segmentation research. Second, it exploits visual saliency to determine the embedding locations within each
semantic sensible region, which coincides with the fact that texts are
often placed around homogeneous regions for better visibility in scenes.
Third, it designs an adaptive text appearance model that determines
the color and brightness of embedded texts by learning from the feature
of real scene text images adaptively. The proposed technique has been
evaluated over five public datasets and the experiments show its superior performance in training accurate and robust scene text detection
and recognition models