Abstract
Given the recent advances in depth prediction from Convolutional Neural Networks (CNNs), this paper investigates
how predicted depth maps from a deep neural network can
be deployed for accurate and dense monocular reconstruction. We propose a method where CNN-predicted dense
depth maps are naturally fused together with depth measurements obtained from direct monocular SLAM. Our fusion scheme privileges depth prediction in image locations
where monocular SLAM approaches tend to fail, e.g. along
low-textured regions, and vice-versa. We demonstrate the
use of depth prediction for estimating the absolute scale of
the reconstruction, hence overcoming one of the major limitations of monocular SLAM. Finally, we propose a framework to efficiently fuse semantic labels, obtained from a single frame, with dense SLAM, yielding semantically coherent
scene reconstruction from a single view. Evaluation results
on two benchmark datasets show the robustness and accuracy of our approach