MAN: Moment Alignment Network for Natural Language Moment Retrieval viaIterative Graph Adjustment
Abstract
This research strives for natural language moment retrieval in long, untrimmed video streams. The problem is
not trivial especially when a video contains multiple moments of interests and the language describes complex temporal dependencies, which often happens in real scenarios. We identify two crucial challenges: semantic misalignment and structural misalignment. However, existing approaches treat different moments separately and do not explicitly model complex moment-wise temporal relations. In
this paper, we present Moment Alignment Network (MAN),
a novel framework that unifies the candidate moment encoding and temporal structural reasoning in a single-shot
feed-forward network. MAN naturally assigns candidate
moment representations aligned with language semantics
over different temporal locations and scales. Most importantly, we propose to explicitly model moment-wise temporal relations as a structured graph and devise an iterative
graph adjustment network to jointly learn the best structure in an end-to-end manner. We evaluate the proposed
approach on two challenging public benchmarks DiDeMo
and Charades-STA, where our MAN significantly outperforms the state-of-the-art by a large margin.