Abstract
Deep models are state-of-the-art for many vision tasks
including video action recognition and video captioning.
Models are trained to caption or classify activity in videos,
but little is known about the evidence used to make such decisions. Grounding decisions made by deep networks has
been studied in spatial visual content, giving more insight
into model predictions for images. However, such studies
are relatively lacking for models of spatiotemporal visual
content – videos. In this work, we devise a formulation that
simultaneously grounds evidence in space and time, in a
single pass, using top-down saliency. We visualize the spatiotemporal cues that contribute to a deep model’s classifi-
cation/captioning output using the model’s internal representation. Based on these spatiotemporal cues, we are able
to localize segments within a video that correspond with a
specific action, or phrase from a caption, without explicitly
optimizing/training for these tasks