Abstract. The world of human-object interactions is rich. While generally we sit on chairs and sofas, if need be we can even sit on TVs or top of
shelves. In recent years, there has been progress in modeling actions and
human-object interactions. However, most of these approaches require
lots of data. It is not clear if the learned representations of actions are
generalizable to new categories. In this paper, we explore the problem of
zero-shot learning of human-object interactions. Given limited verb-noun
interactions in training data, we want to learn a model than can work
even on unseen combinations. To deal with this problem, In this paper,
we propose a novel method using external knowledge graph and graph
convolutional networks which learns how to compose classifiers for verbnoun pairs. We also provide benchmarks on several dataset for zero-shot
learning including both image and video. We hope our method, dataset
and baselines will facilitate future research in this direction