Abstract
We propose a novel hierarchical approach for text-toimage synthesis by inferring semantic layout. Instead of
learning a direct mapping from text to image, our algorithm
decomposes the generation process into multiple steps, in
which it first constructs a semantic layout from the text by
the layout generator and converts the layout to an image by
the image generator. The proposed layout generator progressively constructs a semantic layout in a coarse-to-fine
manner by generating object bounding boxes and refining
each box by estimating object shapes inside the box. The
image generator synthesizes an image conditioned on the
inferred semantic layout, which provides a useful semantic structure of an image matching with the text description.
Our model not only generates semantically more meaningful images, but also allows automatic annotation of generated images and user-controlled generation process by
modifying the generated scene layout. We demonstrate the
capability of the proposed model on challenging MS-COCO
dataset and show that the model can substantially improve
the image quality, interpretability of output and semantic
alignment to input text over existing approaches