Abstract
We present a new task that predicts future locations of
people observed in first-person videos. Consider a firstperson video stream continuously recorded by a wearable
camera. Given a short clip of a person that is extracted
from the complete stream, we aim to predict that person’s
location in future frames. To facilitate this future person
localization ability, we make the following three key observations: a) First-person videos typically involve significant ego-motion which greatly affects the location of the
target person in future frames; b) Scales of the target person act as a salient cue to estimate a perspective effect
in first-person videos; c) First-person videos often capture
people up-close, making it easier to leverage target poses
(e.g., where they look) for predicting their future locations.
We incorporate these three observations into a prediction
framework with a multi-stream convolution-deconvolution
architecture. Experimental results reveal our method to be
effective on our new dataset as well as on a public social
interaction dataset