资源论文Audio-Visual Scene Analysis withSelf-Supervised Multisensory Features

Audio-Visual Scene Analysis withSelf-Supervised Multisensory Features

2019-10-21 | |  42 |   34 |   0
Abstract. The thud of a bouncing ball, the onset of speech as lips open — when visual and audio events occur together, it suggests that there might be a common, underlying event that produced both signals. In this paper, we argue that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation. We propose to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned. We use this learned representation for three applications: (a) sound source localization, i.e. visualizing the source of sound in a video; (b) audio-visual action recognition; and (c) on/offscreen audio source separation, e.g. removing the off-screen translator’s voice from a foreign official’s speech. Code, models, and video results are available on our webpage: http://andrewowens.com/multisensory

上一篇:Visual Text Correction

下一篇:3DMV: Joint 3D-Multi-View Prediction for 3DSemantic Scene Segmentation

用户评价
全部评价

热门资源

  • Learning to Predi...

    Much of model-based reinforcement learning invo...

  • Stratified Strate...

    In this paper we introduce Stratified Strategy ...

  • The Variational S...

    Unlike traditional images which do not offer in...

  • A Mathematical Mo...

    Direct democracy, where each voter casts one vo...

  • Rating-Boosted La...

    The performance of a recommendation system reli...