Abstract. Recent work has demonstrated that it is possible to learn
deep neural networks for monocular depth and ego-motion estimation
from unlabelled video sequences, an interesting theoretical development
with numerous advantages in applications. In this paper, we propose
a number of improvements to these approaches. First, since such selfsupervised approaches are based on the brightness constancy assumption, which is valid only for a subset of pixels, we propose a probabilistic
learning formulation where the network predicts distributions over variables rather than specific values. As these distributions are conditioned
on the observed image, the network can learn which scene and object
types are likely to violate the model assumptions, resulting in more robust learning. We also propose to build on dozens of years of experience
in developing handcrafted structure-from-motion (SFM) algorithms. We
do so by using an off-the-shelf SFM system to generate a supervisory
signal for the deep neural network. While this signal is also noisy, we
show that our probabilistic formulation can learn and account for the
defects of SFM, helping to integrate different sources of information and
boosting the overall performance of the network