video question answering via hierarchical spatio temporal attention networks

资源分类

2019-11-04 |

110 |

121 |

Abstract lenging problem in visual information retrieval, which automatically generates the natural language answer from the referenced video content according to the question. However, the existing visual question answering works only focus on the static image, which may be ineffectively applied to video question answering due to the lack of modeling the temporal dynamics of video contents. In this paper, we consider the problem of open-ended video question answering from the viewpoint of spatio-temporal attentional encoderdecoder learning framework. We propose the hierarchical spatio-temporal attention network for learning the joint representation of the dynamic video contents according to the given question. We then develop the spatio-temporal attentional encoder-decoder learning method with multi-step reasoning process for open-ended video question answering. We construct a large-scale video question answering dataset. The extensive experiments show the effectiveness of our method.

上一篇：player movement models for video game level generation

下一篇：how unlabeled web videos help complex event detection

用户评价

全部评价

还没有评论，说两句吧！

热门资源

Deep Cross-media ...

Cross-media retrieval is a research hotspot in ...
Regularizing RNNs...

Recently, caption generation with an encoder-de...
Learning Expressi...

Facial expression is temporally dynamic event w...
Attributed Graph ...

Graph clustering is a fundamental task which di...
Compact MDDs for ...

Pseudo-Boolean (PB) constraints are usually en...

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com