Abstract
This paper contributes to automatic classifification and localization of human actions in video. Whereas motion is the key ingredient in modern approaches, we assess the benefifits of having objects in the video representation. Rather than considering a handful of carefully selected and localized objects, we conduct an empirical study on the bene- fifit of encoding 15,000 object categories for action using 6 datasets totaling more than 200 hours of video and covering 180 action classes. Our key contributions are i) the fifirst in-depth study of encoding objects for actions, ii) we show that objects matter for actions, and are often semantically relevant as well. iii) We establish that actions have object preferences. Rather than using all objects, selection is advantageous for action recognition. iv) We reveal that objectaction relations are generic, which allows to transferring these relationships from the one domain to the other. And, v) objects, when combined with motion, improve the stateof-the-art for both action classifification and localization