Abstract
Understanding narrated instructional videos is
important for both research and real-world
web applications. Motivated by video dense
captioning, we propose a model to generate procedure captions from narrated instructional videos which are a sequence of stepwise clips with description. Previous works
on video dense captioning learn video segments and generate captions without considering transcripts. We argue that transcripts
in narrated instructional videos can enhance
video representation by providing fine-grained
complimentary and semantic textual information. In this paper, we introduce a framework
to (1) extract procedures by a cross-modality
module, which fuses video content with the
entire transcript; and (2) generate captions by
encoding video frames as well as a snippet
of transcripts within each extracted procedure.
Experiments show that our model can achieve
state-of-the-art performance in procedure extraction and captioning, and the ablation studies demonstrate that both the video frames and
the transcripts are important for the task