Abstract
The performance of image retrieval has been improved
tremendously in recent years through the use of deep feature representations. Most existing methods, however, aim
to retrieve images that are visually similar or semantically
relevant to the query, irrespective of spatial configuration.
In this paper, we develop a spatial-semantic image search
technology that enables users to search for images with
both semantic and spatial constraints by manipulating concept text-boxes on a 2D query canvas. We train a convolutional neural network to synthesize appropriate visual features that captures the spatial-semantic constraints from the
user canvas query. We directly optimize the retrieval performance of the visual features when training our deep neural
network. These visual features then are used to retrieve images that are both spatially and semantically relevant to the
user query. The experiments on large-scale datasets such
as MS-COCO and Visual Genome show that our method
outperforms other baseline and state-of-the-art methods in
spatial-semantic image search