Abstract
A major stumbling block to progress in understanding
basic human interactions, such as getting out of bed or
opening a refrigerator, is lack of good training data. Most
past efforts have gathered this data explicitly: starting with
a laundry list of action labels, and then querying search
engines for videos tagged with each label. In this work,
we do the reverse and search implicitly: we start with a
large collection of interaction-rich video data and then annotate and analyze it. We use Internet Lifestyle Vlogs as the
source of surprisingly large and diverse interaction data.
We show that by collecting the data first, we are able to
achieve greater scale and far greater diversity in terms of
actions and actors. Additionally, our data exposes biases
built into common explicitly gathered data. We make sense
of our data by analyzing the central component of interaction – hands. We benchmark two tasks: identifying semantic
object contact at the video level and non-semantic contact
state at the frame level. We additionally demonstrate future
prediction of hands