Abstract
We present a unified framework for understanding 3D
hand and object interactions in raw image sequences from
egocentric RGB cameras. Given a single RGB image, our
model jointly estimates the 3D hand and object poses, models their interactions, and recognizes the object and action
classes with a single feed-forward pass through a neural
network. We propose a single architecture that does not
rely on external detection algorithms but rather is trained
end-to-end on single images. We further merge and propagate information in the temporal domain to infer interactions between hand and object trajectories and recognize
actions. The complete model takes as input a sequence of
frames and outputs per-frame 3D hand and object pose predictions along with the estimates of object and action categories for the entire sequence. We demonstrate state-of-theart performance of our algorithm even in comparison to the
approaches that work on depth data and ground-truth annotations.