Abstract
Given a static scene, a human can trivially enumeratethe myriad of things that can happen next and character-ize the relative likelihood of each. In the process, we makeuse of enormous amounts of commonsense knowledge abouthow the world works. In this paper, we investigate learn-ing this commonsense knowledge from data. To overcomea lack of densely annotated spatiotemporal data, we learnfrom sequences of abstract images gathered using crowd-sourcing. The abstract scenes provide both object locationand attribute information. We demonstrate qualitativelyand quantitatively that our models produce plausible scenepredictions on both the abstract images, as well as naturalimages taken from the Internet.