Abstract
We consider scenarios in which we wish to perform joint
scene understanding, object tracking, activity recognition,
and other tasks in environments in which multiple people
are wearing body-worn cameras while a third-person static
camera also captures the scene. To do this, we need to establish person-level correspondences across first- and thirdperson videos, which is challenging because the camera
wearer is not visible from his/her own egocentric video,
preventing the use of direct feature matching. In this paper, we propose a new semi-Siamese Convolutional Neural Network architecture to address this novel challenge.
We formulate the problem as learning a joint embedding
space for first- and third-person videos that considers both
spatial- and motion-domain cues. A new triplet loss function is designed to minimize the distance between correct
first- and third-person matches while maximizing the distance between incorrect ones. This end-to-end approach
performs significantly better than several baselines, in part
by learning the first- and third-person features optimized for
matching jointly with the distance measure itself