Abstract
Action recognition in still images has been recently promoted by deep learning. However, the success of these deep
models heavily depends on huge amount of training images
for various action categories, which may not be available
in practice. Alternatively, humans can classify new action
categories after seeing few images, since we may not only
compare appearance similarities between images on hand,
but also attempt to recall importance motion cues from relevant action videos in our memory. To mimic this capacity,
we propose a novel Hybrid Video Memory (HVM) machine,
which can hallucinate temporal features of still images from
video memory, in order to boost action recognition with few
still images. First, we design a temporal memory module
consisting of temporal hallucinating and predicting. Temporal hallucinating can generate temporal features of still
images in an unsupervised manner. Hence, it can be flexibly used in realistic scenarios, where image and video categories may not be consistent. Temporal predicting can
effectively infer action categories for query image, by integrating temporal features of training images and videos
within a domain-adaptation manner. Second, we design a
spatial memory module for spatial predicting. As spatial
and temporal features are complementary to represent different actions, we apply spatial-temporal prediction fusion
to further boost performance. Finally, we design a video
selection module to select strongly-relevant videos as memory. In this case, we can balance the number of images and
videos to reduce prediction bias as well as preserve computation efficiency. To show the effectiveness, we conduct
extensive experiments on three challenging data sets, where
our HVM outperforms a number of recent approaches by
temporal hallucinating from video memory.