Abstract
In this paper, we address the task of natural languageobject retrieval, to localize a target object within a givenimage based on a natural language query of the object. Natural language object retrieval differs from text-based imageretrieval task as it involves spatial information about ob-jects within the scene and global scene context. To addressthis issue, we propose a novel Spatial Context Recurrent ConvNet (SCRC) model as scoring function on candidate boxes for object retrieval, integrating spatial configurationsand global scene-level contextual information into the network. Our model processes query text, local image de-scriptors, spatial configurations and global context features through a recurrent network, outputs the probability of the query text conditioned on each candidate box as a score for the box, and can transfer visual-linguistic knowledge from image captioning domain to our task. Experimental results demonstrate that our method effectively utilizes both local and global information, outperforming previous baselinemethods significantly on different datasets and scenarios, and can exploit large scale vision and language datasetsfor knowledge transfer.