Abstract. The work in this paper is driven by the question if spatio-temporal
correlations are enough for 3D convolutional neural networks (CNN)? Most of
the traditional 3D networks use local spatio-temporal features. We introduce a
new block that models correlations between channels of a 3D CNN with respect
to temporal and spatial features. This new block can be added as a residual unit to
different parts of 3D CNNs. We name our novel block ‘Spatio-Temporal Channel Correlation’ (STC). By embedding this block to the current state-of-the-art
architectures such as ResNext and ResNet, we improve the performance by 2-3%
on the Kinetics dataset. Our experiments show that adding STC blocks to current state-of-the-art architectures outperforms the state-of-the-art methods on the
HMDB51, UCF101 and Kinetics datasets. The other issue in training 3D CNNs
is about training them from scratch with a huge labeled dataset to get a reasonable performance. So the knowledge learned in 2D CNNs is completely ignored.
Another contribution in this work is a simple and effective technique to transfer
knowledge from a pre-trained 2D CNN to a randomly initialized 3D CNN for a
stable weight initialization. This allows us to significantly reduce the number of
training samples for 3D CNNs. Thus, by fine-tuning this network, we beat the
performance of generic and recent methods in 3D CNNs, which were trained on
large video datasets, e.g. Sports-1M, and fine-tuned on the target datasets, e.g.
HMDB51/UCF101.