Abstract
Several works have proposed to learn a two-path neural
network that maps images and texts, respectively, to a same
shared Euclidean space where geometry captures useful semantic relationships. Such a multi-modal embedding can be
trained and used for various tasks, notably image captioning. In the present work, we introduce a new architecture
of this type, with a visual path that leverages recent spaceaware pooling mechanisms. Combined with a textual path
which is jointly trained from scratch, our semantic-visual
embedding offers a versatile model. Once trained under the
supervision of captioned images, it yields new state-of-theart performance on cross-modal retrieval. It also allows the
localization of new concepts from the embedding space into
any input image, delivering state-of-the-art result on the visual grounding of phrases.