Abstract. Perceiving a scene most fully requires all the senses. Yet
modeling how objects look and sound is challenging: most natural scenes
and events contain multiple objects, and the audio track mixes all the
sound sources together. We propose to learn audio-visual object models
from unlabeled video, then exploit the visual context to perform audio source separation in novel videos. Our approach relies on a deep
multi-instance multi-label learning framework to disentangle the audio
frequency bases that map to individual visual objects, even without observing/hearing those objects in isolation. We show how the recovered
disentangled bases can be used to guide audio source separation to obtain
better-separated, object-level sounds. Our work is the first to learn audio
source separation from large-scale “in the wild” videos containing multiple audio sources per video. We obtain state-of-the-art results on visuallyaided audio source separation and audio denoising. Our video results:
http://vision.cs.utexas.edu/projects/separating_object_sounds/