Jointly Learning Energy Expenditures and Activities using
Egocentric Multimodal Signals
Abstract
Physiological signals such as heart rate can provide
valuable information about an individual’s state and activity. However, existing work on computer vision has not
yet explored leveraging these signals to enhance egocentric video understanding. In this work, we propose a model
for reasoning on multimodal data to jointly predict activities and energy expenditures. We use heart rate signals as
privileged self-supervision to derive energy expenditure in
a training stage. A multitask objective is used to jointly optimize the two tasks. Additionally, we introduce a dataset
that contains 31 hours of egocentric video augmented with
heart rate and acceleration signals. This study can lead to
new applications such as a visual calorie counter.