Object Referring in Videos with Language and Human Gaze

资源分类

2019-10-15 |

162 |

148 |

Abstract We investigate the problem of object referring (OR) i.e. to localize a target object in a visual scene coming with a language description. Humans perceive the world more as continued video snippets than as static images, and describe objects not only by their appearance, but also by their spatio-temporal context and motion features. Humans also gaze at the object when they issue a referring expression. Existing works for OR mostly focus on static images only, which fall short in providing many such cues. This paper addresses OR in videos with language and human gaze. To that end, we present a new video dataset for OR, with 30, 000 objects over 5, 000 stereo video sequences annotated for their descriptions and gaze. We further propose a novel network model for OR in videos, by integrating appearance, motion, gaze, and spatio-temporal context into one network. Experimental results show that our method effectively utilizes motion cues, human gaze, and spatio-temporal context. Our method outperforms previous OR methods. For dataset and code, please refer https: //people.ee.ethz.ch/˜arunv/ORGaze.html.

上一篇：MovieGraphs: Towards Understanding Human-Centric Situations from Videos

下一篇：Optimizing Video Object Detection via a Scale-Time Lattice

用户评价

全部评价

还没有评论，说两句吧！

热门资源

Regularizing RNNs...

Recently, caption generation with an encoder-de...
Deep Cross-media ...

Cross-media retrieval is a research hotspot in ...
Learning Expressi...

Facial expression is temporally dynamic event w...
Compact MDDs for ...

Pseudo-Boolean (PB) constraints are usually en...
Supervised Descen...

Many computer vision problems (e.

智能在线

400-630-6780
聆听.建议反馈

E-mail: support@tusaishared.com