Video Interactive Captioning with Human Prompts

资源分类

2019-10-08 |

51 |

34 |

Abstract Video captioning aims at generating a proper sentence to describe the video content. As a video often includes rich visual content and semantic details, different people may be interested in different views. Thus the generated sentence always fails to meet the ad hoc expectations. In this paper, we make a new attempt that, we launch a round of interaction between a human and a captioning agent. After generating an initial caption, the agent asks for a short prompt from the human as a clue of his expectation. Then, based on the prompt, the agent could generate a more accurate caption. We name this process a new task of video interactive captioning (ViCap). Taking a video and an initial caption as input, we devise the ViCap agent which consists of a video encoder, an initial caption encoder, and a refifined caption generator. We show that the ViCap can be trained via a full supervision (with ground-truth) way or a weak supervision (with only prompts) way. For the evaluation of ViCap, we fifirst extend the MSRVTT with interaction ground-truth. Experimental results not only show the prompts can help generate more accurate captions, but also demonstrate the good performance of the proposed method

上一篇：Transferable Adversarial Attacks for Image and Video Object Detection

下一篇：Mutually Reinforced Spatio-Temporal Convolutional Tube for Human Action Recognition

用户评价

全部评价

还没有评论，说两句吧！

热门资源

Stratified Strate...

In this paper we introduce Stratified Strategy ...
The Variational S...

Unlike traditional images which do not offer in...
Learning to learn...

The move from hand-designed features to learned...
A Mathematical Mo...

Direct democracy, where each voter casts one vo...
Learning to Predi...

Much of model-based reinforcement learning invo...

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com