Abstract. The thriving of video sharing services brings new challenges
to video retrieval, e.g. the rapid growth in video duration and content
diversity. Meeting such challenges calls for new techniques that can effectively retrieve videos with natural language queries. Existing methods
along this line, which mostly rely on embedding videos as a whole, remain far from satisfactory for real-world applications due to the limited
expressive power. In this work, we aim to move beyond this limitation
by delving into the internal structures of both sides, the queries and the
videos. Specifically, we propose a new framework called Find and Focus (FIFO), which not only performs top-level matching (paragraph vs.
video), but also makes part-level associations, localizing a video clip for
each sentence in the query with the help of a focusing guide. These levels are complementary – the top-level matching narrows the search while
the part-level localization refines the results. On both ActivityNet Captions and modified LSMDC datasets, the proposed framework achieves
remarkable performance gains