Abstract
In this paper, we are interested in modeling complex activities that occur in a typical household. We propose to use
programs, i.e., sequences of atomic actions and interactions,
as a high level representation of complex tasks. Programs
are interesting because they provide a non-ambiguous representation of a task, and allow agents to execute them.
However, nowadays, there is no database providing this type
of information. Towards this goal, we first crowd-source
programs for a variety of activities that happen in people’s
homes, via a game-like interface used for teaching kids how
to code. Using the collected dataset, we show how we can
learn to extract programs directly from natural language
descriptions or from videos. We then implement the most
common atomic (inter)actions in the Unity3D game engine,
and use our programs to “drive” an artificial agent to execute tasks in a simulated household environment. Our
VirtualHome simulator allows us to create a large activity
video dataset with rich ground-truth, enabling training and
testing of video understanding models. We further showcase
examples of our agent performing tasks in our VirtualHome
based on language descriptions