Abstract
Action recognition and human pose estimation are
closely related but both problems are generally handled
as distinct tasks in the literature. In this work, we propose a multitask framework for jointly 2D and 3D pose
estimation from still images and human action recognition from video sequences. We show that a single architecture can be used to solve the two problems in an effi-
cient way and still achieves state-of-the-art results. Additionally, we demonstrate that optimization from end-toend leads to significantly higher accuracy than separated
learning. The proposed architecture can be trained with
data from different categories simultaneously in a seamlessly way. The reported results on four datasets (MPII,
Human3.6M, Penn Action and NTU) demonstrate the effectiveness of our method on the targeted tasks