Abstract. Most activity localization methods in the literature suffer from the
burden of frame-wise annotation requirement. Learning from weak labels may be a
potential solution towards reducing such manual labeling effort. Recent years have
witnessed a substantial influx of tagged videos on the Internet, which can serve
as a rich source of weakly-supervised training data. Specifically, the correlations
between videos with similar tags can be utilized to temporally localize the activities.
Towards this goal, we present W-TALC, a Weakly-supervised Temporal Activity
Localization and Classification framework using only video-level labels. The
proposed network can be divided into two sub-networks, namely the Two-Stream
based feature extractor network and a weakly-supervised module, which we learn
by optimizing two complimentary loss functions. Qualitative and quantitative
results on two challenging datasets - Thumos14 and ActivityNet1.2, demonstrate
that the proposed method is able to detect activities at a fine granularity and achieve
better performance than current state-of-the-art methods