资源论文Beyond instance-level image retrieval: Leveraging captions to learn a global visual representation for semantic retrieval

Beyond instance-level image retrieval: Leveraging captions to learn a global visual representation for semantic retrieval

2019-12-10 | |  113 |   99 |   0

Abstract

Querying with an example image is a simple and intuitive interface to retrieve information from a visual database. Most of the research in image retrieval has focused on the task of instance-level image retrieval, where the goal is to retrieve images that contain the same object instance as the query image. In this work we move beyond instance-level retrieval and consider the task of semantic image retrieval in complex scenes, where the goal is to retrieve images that share the same semantics as the query image. We show that, despite its subjective nature, the task of semantically ranking visual scenes is consistently implemented across a pool of human annotators. We also show that a similarity based on human-annotated region-level captions is highly correlated with the human ranking and constitutes a good computable surrogate. Following this observation, we learn a visual embedding of the images where the similarity in the visual space is correlated with their semantic similarity surrogate. We further extend our model to learn a joint embedding of visual and textual cues that allows one to query the database using a text modififier in addition to the query image, adapting the results to the modififier. Finally, our model can ground the ranking decisions by showing regions that contributed the most to the similarity between pairs of images, providing a visual explanation of the similarity

上一篇:Annotating Object Instances with a Polygon-RNN

下一篇:Boundary-aware Instance Segmentation

用户评价
全部评价

热门资源

  • Deep Cross-media ...

    Cross-media retrieval is a research hotspot in ...

  • Regularizing RNNs...

    Recently, caption generation with an encoder-de...

  • The Variational S...

    Unlike traditional images which do not offer in...

  • Visual Reinforcem...

    For an autonomous agent to fulfill a wide range...

  • Joint Pose and Ex...

    Facial expression recognition (FER) is a challe...