Abstract
This paper presents a method to assign a semantic label to a 3D reconstructed trajectory from multiview image
streams. The key challenge of the semantic labeling lies
in the self-occlusion and photometric inconsistency caused
by object and social interactions, resulting in highly fragmented trajectory reconstruction with noisy semantic labels. We address this challenge by introducing a new representation called 3D semantic map—a probability distribution over labels per 3D trajectory constructed by a set of
semantic recognition across multiple views. Our conjecture
is that among many views, there exist a set of views that are
more informative than the others. We build the 3D semantic
map based on a likelihood of visibility and 2D recognition
confidence and identify the view that best represents the semantics of the trajectory. We use this 3D semantic map and
trajectory affinity computed by local rigid transformation
to precisely infer labels as a whole. This global inference
quantitatively outperforms the baseline approaches in terms
of predictive validity, representation robustness, and affinity effectiveness. We demonstrate that our algorithm can
robustly compute the semantic labels of a large scale trajectory set (e.g., millions of trajectories) involving real-world
human interactions with object, scenes, and people