Abstract
We develop a unified framework for complex event retrieval, recognition and recounting. The framework is based
on a compact video representation that exploits the temporal correlations in image features. Our feature alignment
procedure identifies and removes the feature redundancies
across frames and outputs an intermediate tensor representation we call video imprint. The video imprint is then fed
into a reasoning network, whose attention mechanism parallels that of memory networks used in language modeling.
The reasoning network simultaneously recognizes the event
category and locates the key pieces of evidence for event
recounting. In event retrieval tasks, we show that the compact video representation aggregated from the video imprint
achieves significantly better retrieval accuracy compared
with existing methods. We also set new state of the art
results in event recognition tasks with an additional bene-
fit: The latent structure in our reasoning network highlights
the areas of the video imprint and can be directly used for
event recounting. As video imprint maps back to locations
in the video frames, the network allows not only the identification of key frames but also specific areas inside each
frame which are most influential to the decision process