Abstract. Supervised learning techniques have shown substantial progress
on video summarization. State-of-the-art approaches mostly regard the
predicted summary and the human summary as two sequences (sets),
and minimize discriminative losses that measure element-wise discrepancy. Such training objectives do not explicitly model how well the predicted summary preserves semantic information in the video. Moreover,
those methods often demand a large amount of human generated summaries. In this paper, we propose a novel sequence-to-sequence learning
model to address these deficiencies. The key idea is to complement the
discriminative losses with another loss which measures if the predicted
summary preserves the same information as in the original video. To this
end, we propose to augment standard sequence learning models with an
additional “retrospective encoder” that embeds the predicted summary
into an abstract semantic space. The embedding is then compared to
the embedding of the original video in the same space. The intuition is
that both embeddings ought to be close to each other for a video and its
corresponding summary. Thus our approach adds to the discriminative
loss a metric learning loss that minimizes the distance between such pairs
while maximizing the distances between unmatched ones. One important
advantage is that the metric learning loss readily allows learning from
videos without human generated summaries. Extensive experimental results show that our model outperforms existing ones by a large margin
in both supervised and semi-supervised settings