From Captions to Visual Concepts and Back

资源分类

2019-12-25 |

45 |

43 |

Abstract

This paper presents a novel approach for automaticallygenerating image descriptions: visual detectors, languagemodels, and multimodal similarity models learnt directlyfrom a dataset of image captions. We use multiple instancelearning to train visual detectors for words that commonlyoccur in captions, including many different parts of speechsuch as nouns, verbs, and adjectives. The word detectoroutputs serve as conditional inputs to a maximum-entropylanguage model. The language model learns from a set ofover 400,000 image descriptions to capture the statisticsof word usage. We capture global semantics by re-rankingcaption candidates using sentence-level features and a deepmultimodal similarity model. Our system is state-of-the-arton the official Microsoft COCO benchmark, producing aBLEU-4 score of 29.1%. When human judges compare thesystem captions to ones written by other people on our held-out test set, the system captions have equal or better quality34% of the time.

上一篇：Visual Vibrometry: Estimating Material Properties from Small Motions in Video

下一篇：Visual Recognition by Counting Instances: A Multi-Instance Cardinality Potential Kernel

用户评价

全部评价

还没有评论，说两句吧！

热门资源

Learning to Predi...

Much of model-based reinforcement learning invo...
Stratified Strate...

In this paper we introduce Stratified Strategy ...
The Variational S...

Unlike traditional images which do not offer in...
A Mathematical Mo...

Direct democracy, where each voter casts one vo...
Rating-Boosted La...

The performance of a recommendation system reli...

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com