Abstract
Existing methods to recognize actions in static images take the images at their face value, learning the
appearances—objects, scenes, and body poses—that distinguish each action class. However, such models are deprived
of the rich dynamic structure and motions that also define
human activity. We propose an approach that hallucinates
the unobserved future motion implied by a single snapshot
to help static-image action recognition. The key idea is to
learn a prior over short-term dynamics from thousands of
unlabeled videos, infer the anticipated optical flow on novel
static images, and then train discriminative models that exploit both streams of information. Our main contributions
are twofold. First, we devise an encoder-decoder convolutional neural network and a novel optical flow encoding
that can translate a static image into an accurate flow map.
Second, we show the power of hallucinated flow for recognition, successfully transferring the learned motion into a
standard two-stream network for activity recognition. On
seven datasets, we demonstrate the power of the approach.
It not only achieves state-of-the-art accuracy for dense optical flow prediction, but also consistently enhances recognition of actions and dynamic scenes