Abstract. In this paper, we focus on exploring modality-temporal mutual information for RGB-D action recognition. In order to learn timevarying information and multi-modal features jointly, we propose a novel
deep bilinear learning framework. In the framework, we propose bilinear
blocks that consist of two linear pooling layers for pooling the input
cube features from both modality and temporal directions, separately.
To capture rich modality-temporal information and facilitate our deep
bilinear learning, a new action feature called modality-temporal cube is
presented in a tensor structure for characterizing RGB-D actions from a
comprehensive perspective. Our method is extensively tested on two public datasets with four different evaluation settings, and the results show
that the proposed method outperforms the state-of-the-art approaches