High Performance Gesture Recognition via Effective and Efficient Temporal
Modeling
Abstract
State-of-the-art hand gesture recognition methods
have investigated the spatiotemporal features based
on 3D convolutional neural networks (3DCNNs)
or convolutional long short-term memory (ConvLSTM). However, they often suffer from the ineffi-
ciency due to the high computational complexity of
their network structures. In this paper, we focus instead on the 1D convolutional neural networks and
propose a simple and efficient architectural unit,
Multi-Kernel Temporal Block (MKTB), that models the multi-scale temporal responses by explicitly applying different temporal kernels. Then, we
present a Global Refinement Block (GRB), which
is an attention module for shaping the global temporal features based on the cross-channel similarity.
By incorporating the MKTB and GRB, our architecture can effectively explore the spatiotemporal
features within tolerable computational cost. Extensive experiments conducted on public datasets
demonstrate that our proposed model achieves the
state-of-the-art with higher efficiency. Moreover,
the proposed MKTB and GRB are plug-and-play
modules and the experiments on other tasks, like
video understanding and video-based person reidentification, also display their good performance
in efficiency and capability of generalization