Joint Person Segmentation and Identification in
Synchronized First- and Third-person Videos
Abstract. In a world of pervasive cameras, public spaces are often captured from multiple perspectives by cameras of different types, both fixed
and mobile. An important problem is to organize these heterogeneous collections of videos by finding connections between them, such as identifying correspondences between the people appearing in the videos and the
people holding or wearing the cameras. In this paper, we wish to solve two
specific problems: (1) given two or more synchronized third-person videos
of a scene, produce a pixel-level segmentation of each visible person and
identify corresponding people across different views (i.e., determine who
in camera A corresponds with whom in camera B), and (2) given one
or more synchronized third-person videos as well as a first-person video
taken by a mobile or wearable camera, segment and identify the camera
wearer in the third-person videos. Unlike previous work which requires
ground truth bounding boxes to estimate the correspondences, we perform person segmentation and identification jointly. We find that solving
these two problems simultaneously is mutually beneficial, because better
fine-grained segmentation allows us to better perform matching across
views, and information from multiple views helps us perform more accurate segmentation. We evaluate our approach on two challenging datasets
of interacting people captured from multiple wearable cameras, and show
that our proposed method performs significantly better than the stateof-the-art on both person segmentation and identification