Abstract
We propose a unified framework DISCOVER to simul-taneously discover important segments, classify high-levelevents and generate recounting for large amounts of un-constrained web videos. The motivation is our observationthat many video events are characterized by certain impor-tant segments. Our goal is to find the important segmentsand capture their information for event classification andrecounting. We introduce an evidence localization modelwhere evidence locations are modeled as latent variables.We impose constraints on global video appearance, localevidence appearance and the temporal structure of the ev-idence. The model is learned via a max-margin frameworkand allows efficient inference. Our method does not requireannotating sources of evidence, and is jointly optimized for event classification and recounting. Experimental results are shown on the challenging TRECVID 2013 MEDTest dataset.