Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations

资源分类

2020-02-25 |

358 |

146 |

Abstract
In vision-and-language grounding problems, fine-grained representations of the image are considered to be of paramount importance. Most of the current systems incorporate visual features and textual concepts as a sketch of an image. However, plainly inferred representations are usually undesirable in that they are composed of separate components, the relations of which are elusive. In this work, we aim at representing an image with a set of integrated visual regions and corresponding textual concepts, reflecting certain semantics. To this end, we build the Mutual Iterative Attention (MIA) module, which integrates correlated visual features and textual concepts, respectively, by aligning the two modalities. We evaluate the proposed approach on two representative vision-and-language grounding tasks, i.e., image captioning and visual question answering. In both tasks, the semanticgrounded image representations consistently boost the performance of the baseline models under all metrics across the board. The results demonstrate that our approach is effective and generalizes well to a wide range of models for image-related applications.2

上一篇：Paraphrase Generation with Latent Bag of Words

下一篇：Learning metrics for persistence-based summaries and applications for graph classification

用户评价

全部评价

还没有评论，说两句吧！

热门资源

A Mathematical Mo...

Direct democracy, where each voter casts one vo...
Learning to Predi...

Much of model-based reinforcement learning invo...
Joint Pose and Ex...

Facial expression recognition (FER) is a challe...
The Variational S...

Unlike traditional images which do not offer in...
Depth Super Resol...

We tackle the problem of jointly increasing the...

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com