Abstract
Robust perception-action models should be learned from
training data with diverse visual appearances and realistic behaviors, yet current approaches to deep visuomotor
policy learning have been generally limited to in-situ models learned from a single vehicle or simulation environment.
We advocate learning a generic vehicle motion model from
large scale crowd-sourced video data, and develop an endto-end trainable architecture for learning to predict a distribution over future vehicle egomotion from instantaneous
monocular camera observations and previous vehicle state.
Our model incorporates a novel FCN-LSTM architecture,
which can be learned from large-scale crowd-sourced vehicle action data, and leverages available scene segmentation side tasks to improve performance under a privileged
learning paradigm. We provide a novel large-scale dataset
of crowd-sourced driving behavior suitable for training our
model, and report results predicting the driver action on
held out sequences across diverse conditions