Abstract
We propose a probabalistic model of single source multi- modal generation and show how algorithms for maximizing mutual infor- mation can find the correspondences between components of each signal. We show how non-parametric techniques for finding informative sub- spaces can capture the complex statistical relationship between signals in difierent modalities. We extend a previous technique for finding infor- mative subspaces to include new priors on the pro jection weights, yield- ing more robust results. Applied to human speakers, our model can find the relationship between audio speech and video of facial motion, and partially segment out background events in both channels. We present new results on the problem of audio-visual verification, and show how the audio and video of a speaker can be matched even when no prior model of the speaker’s voice or appearance is available.