Abstract
We study the task of directly modelling a visually intelligent agent. Computer vision typically focuses on solving
various subtasks related to visual intelligence. We depart
from this standard approach to computer vision; instead we
directly model a visually intelligent agent. Our model takes
visual information as input and directly predicts the actions
of the agent. Toward this end we introduce DECADE, a
dataset of ego-centric videos from a dog’s perspective as
well as her corresponding movements. Using this data we
model how the dog acts and how the dog plans her movements. We show under a variety of metrics that given just
visual input we can successfully model this intelligent agent
in many situations. Moreover, the representation learned by
our model encodes distinct information compared to representations trained on image classification, and our learned
representation can generalize to other domains. In particular, we show strong results on the task of walkable surface
estimation and scene classification by using this dog modelling task as representation learning