Improving the Performance of Unimodal Dynamic Hand-Gesture Recognitionwith Multimodal Training
Abstract
We present an efficient approach for leveraging the
knowledge from multiple modalities in training unimodal
3D convolutional neural networks (3D-CNNs) for the task
of dynamic hand gesture recognition. Instead of explicitly
combining multimodal information, which is commonplace
in many state-of-the-art methods, we propose a different
framework in which we embed the knowledge of multiple
modalities in individual networks so that each unimodal
network can achieve an improved performance. In particular, we dedicate separate networks per available modality and enforce them to collaborate and learn to develop
networks with common semantics and better representations. We introduce a “spatiotemporal semantic alignment”
loss (SSA) to align the content of the features from different networks. In addition, we regularize this loss with our
proposed “focal regularization parameter” to avoid negative knowledge transfer. Experimental results show that
our framework improves the test time recognition accuracy of unimodal networks, and provides the state-of-the-art
performance on various dynamic hand gesture recognition
datasets.