Abstract
The emergence of wearable devices such as portable cameras and smart glasses makes it possible to record lifelogging first-person videos. Browsing such long unstructured videos is time-consuming and tedious. This paper studies the discovery of moments of user’s major or special interest (i.e., highlights) in a video, for generating the sum-marization of first-person videos. Specifically, we propose a novel pairwise deep ranking model that employs deep learning techniques to learn the relationship between highlight and non-highlight video segments. A two-stream network structure by representing video segments from complementary information on appearance of video frames and temporal dynamics across frames is developed for videohighlight detection. Given a long personal video, equippedwith the highlight detection model, a highlight score is assigned to each segment. The obtained highlight segments are applied for summarization in two ways: video timelapse and video skimming. The former plays the highlight (non-highlight) segments at low (high) speed rates, while the latter assembles the sequence of segments with the highest scores. On 100 hours of first-person videos for 15 unique sports categories, our highlight detection achieves the improvement over the state-of-the-art RankSVM method by 10.5% in terms of accuracy. Moreover, our approaches produce video summary with better quality by a user study from 35 human subjects.