Abstract
Video summarization is a challenging problem in partbecause knowing which part of a video is important requiresprior knowledge about its main topic. We present TVSum, an unsupervised video summarization framework that uses title-based image search results to find visually important shots. We observe that a video title is often carefully cho-sen to be maximally descriptive of its main topic, and henceimages related to the title can serve as a proxy for impor-tant visual concepts of the main topic. However, because titles are free-formed, unconstrained, and often written am-biguously, images searched using the title can contain noise (images irrelevant to video content) and variance (images of different topics). To deal with this challenge, we devel-oped a novel co-archetypal analysis technique that learnscanonical visual concepts shared between video and images, but not in either alone, by finding a joint-factorial representation of two data sets. We introduce a new benchmark dataset, TVSum50, that contains 50 videos and their shotlevel importance scores annotated via crowdsourcing. Experimental results on two datasets, SumMe and TVSum50, suggest our approach produces superior quality summaries compared to several recently proposed approaches.