FSA-Net: Learning Fine-Grained Structure Aggregation for Head PoseEstimation from a Single Image
Abstract
This paper proposes a method for head pose estimation
from a single image. Previous methods often predict head
poses through landmark or depth estimation and would require more computation than necessary. Our method is
based on regression and feature aggregation. For having
a compact model, we employ the soft stagewise regression
scheme. Existing feature aggregation methods treat inputs
as a bag of features and thus ignore their spatial relationship in a feature map. We propose to learn a fine-grained
structure mapping for spatially grouping features before aggregation. The fine-grained structure provides part-based
information and pooled values. By utilizing learnable and
non-learnable importance over the spatial location, different model variants can be generated and form a complementary ensemble. Experiments show that our method outperforms the state-of-the-art methods including both the
landmark-free ones and the ones based on landmark or
depth estimation. With only a single RGB frame as input, our method even outperforms methods utilizing multimodality information (RGB-D, RGB-Time) on estimating
the yaw angle. Furthermore, the memory overhead of our
model is 100× smaller than those of previous methods.