Abstract
Dense video captioning is a newly emerging task that
aims at both localizing and describing all events in a video.
We identify and tackle two challenges on this task, namely,
(1) how to utilize both past and future contexts for accurate
event proposal predictions, and (2) how to construct informative input to the decoder for generating natural event descriptions. First, previous works predominantly generate
temporal event proposals in the forward direction, which
neglects future video context. We propose a bidirectional
proposal method that effectively exploits both past and future contexts to make proposal predictions. Second, different events ending at (nearly) the same time are indistinguishable in the previous works, resulting in the same
captions. We solve this problem by representing each event
with an attentive fusion of hidden states from the proposal
module and video contents (e.g., C3D features). We further propose a novel context gating mechanism to balance
the contributions from the current event and its surrounding
contexts dynamically. We empirically show that our attentively fused event representation is superior to the proposal
hidden states or video contents alone. By coupling proposal
and captioning modules into one unified framework, our
model outperforms the state-of-the-arts on the ActivityNet
Captions dataset with a relative gain of over 100% (Meteor
score increases from 4.82 to 9.65)