Abstract
Visual events are usually accompanied by sounds in our
daily lives. We pose the question: Can the machine learn
the correspondence between visual scene and the sound,
and localize the sound source only by observing sound and
visual scene pairs like human? In this paper, we propose a
novel unsupervised algorithm to address the problem of localizing the sound source in visual scenes. A two-stream
network structure which handles each modality, with attention mechanism is developed for sound source localization. Moreover, although our network is formulated within
the unsupervised learning framework, it can be extended
to a unified architecture with a simple modification for the
supervised and semi-supervised learning settings as well.
Meanwhile, a new sound source dataset is developed for
performance evaluation. Our empirical evaluation shows
that the unsupervised method eventually go through false
conclusion in some cases. We also show that even with a
few supervision, i.e., semi-supervised setup, false conclusion is able to be corrected effectively