Abstract
We address the problem of categorizing turn-taking interactions be- tween individuals. Social interactions are characterized by turn-taking and arise frequently in real-world videos. Our approach is based on the use of temporal causal analysis to decompose a space-time visual word representation of video into co-occuring independent segments, called causal sets [1]. These causal sets then serves the input to a multiple instance learning framework to categorize turn- taking interactions. We introduce a new turn-taking interactions dataset consist- ing of social games and sports rallies. We demonstrate that our formulation of multiple instance learning (QP-MISVM) is better able to leverage the repetitive structure in turn-taking interactions and demonstrates superior performance rela- tive to a conventional bag of words model.