Abstract
We present a new approach to modeling and processing mul- timedia data. This approach is based on graphical models that combine audio and video variables. We demonstrate it by developing a new al- gorithm for tracking a moving ob ject in a cluttered, noisy scene using two microphones and a camera. Our model uses unobserved variables to describe the data in terms of the process that generates them. It is therefore able to capture and exploit the statistical structure of the audio and video data separately, as well as their mutual dependencies. Model parameters are learned from data via an EM algorithm, and automatic calibration is performed as part of this procedure. Tracking is done by Bayesian inference of the ob ject location from data. We demonstrate suc- cessful performance on multimedia clips captured in real world scenarios using off-the-shelf equipment.