Discriminative Bimodal Networks for
Visual Localization and Detection with Natural Language Queries
Abstract
Associating image regions with text queries has been
recently explored as a new way to bridge visual and linguistic representations. A few pioneering approaches have
been proposed based on recurrent neural language models
trained generatively (e.g., generating captions), but achieving somewhat limited localization accuracy. To better address natural-language-based visual entity localization, we
propose a discriminative approach. We formulate a discriminative bimodal neural network (DBNet), which can be
trained by a classifier with extensive use of negative samples. Our training objective encourages better localization on single images, incorporates text phrases in a broad
range, and properly pairs image regions with text phrases
into positive and negative examples. Experiments on the
Visual Genome dataset demonstrate the proposed DBNet
significantly outperforms previous state-of-the-art methods
both for localization on single images and for detection on
multiple images. We we also establish an evaluation protocol for natural-language visual detection. Code is available at: http://ytzhang.net/projects/dbnet