Abstract
Fine-grained activity understanding in videos has attracted considerable recent attention with a shift from action classification to detailed actor and action understanding that provides compelling results for perceptual needs of
cutting-edge autonomous systems. However, current methods for detailed understanding of actor and action have significant limitations: they require large amounts of finely
labeled data, and they fail to capture any internal relationship among actors and actions. To address these issues, in this paper, we propose a novel, robust multi-task
ranking model for weakly-supervised actor-action segmentation where only video-level tags are given for training
samples. Our model is able to share useful information
among different actors and actions while learning a ranking
matrix to select representative supervoxels for actors and
actions respectively. Final segmentation results are generated by a conditional random field that considers various ranking scores for video parts. Extensive experimental results on the Actor-Action Dataset (A2D) demonstrate
that the proposed approach outperforms the state-of-the-art
weakly supervised methods and performs as well as the topperforming fully supervised method.