Abstract
Understanding human actions is a key problem in com-puter vision. However, recognizing actions is only the firststep of understanding what a person is doing. In this pa-per, we introduce the problem of predicting why a personhas performed an action in images. This problem has manyapplications in human activity understanding, such as an-ticipating or explaining an action. To study this problem, we introduce a new dataset of people performing actions anno-tated with likely motivations. However, the information inan image alone may not be sufficient to automatically solvethis task. Since humans can rely on their lifetime of expe-riences to infer motivation, we propose to give computervision systems access to some of these experiences by usingrecently developed natural language models to mine knowledge stored in massive amounts of text. While we are stillfar away from fully understanding motivation, our results suggest that transferring knowledge from language into vision can help machines understand why people in images might be performing an action.