Abstract
In this paper,we tackle the problem of retrieving videos
using complex natural language queries.Towards this goal,
we first parse the sentential descriptions into a semantic
graph,which is then matched to visual concepts using a
generalized bipartite matching algorithm.Our approach
exploits object appearance,morion and spatial relations,
and learns the importance of each term using structure pre-
diction.We demonstrate the efectiveness of our approach
on a new dataset designed for semantic search in the context
of autonomous driving,which exhibits complex and highly
dyamic scenes with many objects.We show thar our ap-
proach is able to locate a major portion of the objects de-
scribed in the query with high accuracy,and improve the
relevance in video retrieval.