Abstract
Event detection in unconstrained videos is conceived
as a content-based video retrieval with two modalities:
textual and visual. Given a text describing a novel event,
the goal is to rank related videos accordingly. This task is
zero-exemplar, no video examples are given to the novel
event.
Related works train a bank of concept detectors on external
data sources. These detectors predict confidence scores for
test videos, which are ranked and retrieved accordingly. In
contrast, we learn a joint space in which the visual and
textual representations are embedded. The space casts a
novel event as a probability of pre-defined events. Also, it
learns to measure the distance between an event and its
related videos.
Our model is trained end-to-end on publicly available
EventNet. When applied to TRECVID Multimedia Event
Detection dataset, it outperforms the state-of-the-art by a
considerable margin.